Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis
()
About this ebook
The ability to proficiently perform statistical modeling is a fundamental skill for data scientists and essential for businesses reliant on data insights. Building Statistical Models with Python is a comprehensive guide that will empower you to leverage mathematical and statistical principles in data assessment, understanding, and inference generation.
This book not only equips you with skills to navigate the complexities of statistical modeling, but also provides practical guidance for immediate implementation through illustrative examples. Through emphasis on application and code examples, you’ll understand the concepts while gaining hands-on experience. With the help of Python and its essential libraries, you’ll explore key statistical models, including hypothesis testing, regression, time series analysis, classification, and more.
By the end of this book, you’ll gain fluency in statistical modeling while harnessing the full potential of Python's rich ecosystem for data analysis.
Related to Building Statistical Models in Python
Related ebooks
Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning Rating: 0 out of 5 stars0 ratingsA Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg Rating: 0 out of 5 stars0 ratingsCausal Inference in R: Decipher complex relationships with advanced R techniques for data-driven decision-making Rating: 0 out of 5 stars0 ratingsR Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 4 out of 5 stars4/5Data Science Career Guide Interview Preparation Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Basics of Data Analysis Rating: 0 out of 5 stars0 ratingsDecoding Large Language Models: An exhaustive guide to understanding, implementing, and optimizing LLMs for NLP applications Rating: 0 out of 5 stars0 ratingsSimulation for Data Science with R Rating: 0 out of 5 stars0 ratingsApplied Analytics through Case Studies Using SAS and R: Implementing Predictive Models and Machine Learning Techniques Rating: 0 out of 5 stars0 ratingsData Science for Beginners: Unlocking the Power of Data with Easy-to-Understand Concepts and Techniques. Part 3 Rating: 0 out of 5 stars0 ratingsRegression Analysis Guide: A Comprehensive Guide for Data Analysts and Researchers Rating: 0 out of 5 stars0 ratingsPractical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsUltimate Enterprise Data Analysis and Forecasting using Python Rating: 0 out of 5 stars0 ratingsData Analysis for Beginners: The ABCs of Data Analysis. An Easy-to-Understand Guide for Beginners Rating: 0 out of 5 stars0 ratingsPractical Machine Learning for Streaming Data with Python: Design, Develop, and Validate Online Learning Models Rating: 0 out of 5 stars0 ratingsCreating Good Data: A Guide to Dataset Structure and Data Representation Rating: 0 out of 5 stars0 ratings15 Math Concepts Every Data Scientist Should Know: Understand and learn how to apply the math behind data science algorithms Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science.: Understand, analyze, and predict data using Machine Learning concepts and tools Rating: 0 out of 5 stars0 ratingsStatistics for Machine Learning Rating: 3 out of 5 stars3/5Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data Rating: 0 out of 5 stars0 ratingsPredictive Analytics Rating: 0 out of 5 stars0 ratingsIntroduction to Statistical and Machine Learning Methods for Data Science Rating: 0 out of 5 stars0 ratingsData Observability for Data Engineering: Proactive strategies for ensuring data accuracy and addressing broken data pipelines Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5
Reviews for Building Statistical Models in Python
0 ratings0 reviews
Book preview
Building Statistical Models in Python - Huy Hoang Nguyen
BIRMINGHAM—MUMBAI
Building Statistical Models in Python
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Ali Abidi
Publishing Product Manager: Sanjana Gupta
Senior Editor: Sushma Reddy
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Book Project Manager: Kirti Pisat
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Hemangini Bari
Production Designer: Prashant Ghare
Marketing Coordinator: Nivedita Singh
First published: August 2023
Production reference: 3310823
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul's Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80461-428-0
www.packtpub.com
To my parents, Thieu and Tang, for their enormous support and faith in me.
To my wife, Tam, for her endless love, dedication, and courage.
- Huy Hoang Nguyen
To my daughter, Lydie, for demonstrating how work and dedication regenerate inspiration and creativity. To my wife, Helene, for her love and support.
– Paul Adams
To my partner, Kate, who has always supported my endeavors.
– Stuart Miller
Contributors
About the authors
Huy Hoang Nguyen is a mathematician and data scientist with extensive experience in advanced mathematics, strategic leadership, and applied machine learning research. He holds a PhD in Mathematics, as well as two Master’s degrees in Applied Mathematics and Data Science. His previous work focused on Partial Differential Equations, Functional Analysis, and their applications in Fluid Mechanics. After transitioning from academia to the healthcare industry, he has undertaken a variety of data science projects, ranging from traditional machine learning to deep learning.
Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering, and classification. Paul holds an MSc in Data Science from Southern Methodist University.
Stuart Miller is a Machine Learning Engineer with a wide range of experience. Stuart has applied machine learning methods to various projects in industries ranging from insurance to semiconductor manufacturing. Stuart holds degrees in data science, electrical engineering, and physics.
About the reviewers
Krishnan Raghavan is an IT Professional with over 20+ years of experience in software development and delivery excellence across multiple domains and technology ranging from C++ to Java, Python, Data Warehousing, and Big Data tools and technologies.
When not working, Krishnan likes to spend time with his wife and daughter, reading fiction and nonfiction as well as technical books. Krishnan tries to give back to the community by being part of the GDG Pune Volunteer Group, helping the team organize events. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at or via LinkedIn: .
I would like to thank my wife Anita and daughter Ananya for giving me the time and space to review this book.
Karthik Dulam is a Principal Data Scientist at EDB. He is passionate about all things data with a particular focus on data engineering, statistical modeling, and machine learning. He has a diverse background delivering machine learning solutions for the healthcare, IT, automotive, telecom, tax, and advisory industries. He actively engages with students as a guest speaker at esteemed universities delivering insightful talks on machine learning use cases.
I would like to thank my wife, Sruthi Anem, for her unwavering support and patience. I also want to thank my family, friends, and colleagues who have played an instrumental role in shaping the person I am today. Their unwavering support, encouragement, and belief in me have been a constant source of inspiration.
Table of Contents
Preface
Part 1: Introduction to Statistics
1
Sampling and Generalization
Software and environment setup
Population versus sample
Population inference from samples
Randomized experiments
Observational study
Sampling strategies – random, systematic, stratified, and clustering
Probability sampling
Non-probability sampling
Summary
2
Distributions of Data
Technical requirements
Understanding data types
Nominal data
Ordinal data
Interval data
Ratio data
Visualizing data types
Measuring and describing distributions
Measuring central tendency
Measuring variability
Measuring shape
The normal distribution and central limit theorem
The Central Limit Theorem
Bootstrapping
Confidence intervals
Standard error
Correlation coefficients (Pearson’s correlation)
Permutations
Permutations and combinations
Permutation testing
Transformations
Summary
References
3
Hypothesis Testing
The goal of hypothesis testing
Overview of a hypothesis test for the mean
Scope of inference
Hypothesis test steps
Type I and Type II errors
Type I errors
Type II errors
Basics of the z-test – the z-score, z-statistic, critical values, and p-values
The z-score and z-statistic
A z-test for means
z-test for proportions
Power analysis for a two-population pooled z-test
Summary
4
Parametric Tests
Assumptions of parametric tests
Normally distributed population data
Equal population variance
T-test – a parametric hypothesis test
T-test for means
Two-sample t-test – pooled t-test
Two-sample t-test – Welch’s t-test
Paired t-test
Tests with more than two groups and ANOVA
Multiple tests for significance
ANOVA
Pearson’s correlation coefficient
Power analysis examples
Summary
References
5
Non-Parametric Tests
When parametric test assumptions are violated
Permutation tests
The Rank-Sum test
The test statistic procedure
Normal approximation
Rank-Sum example
The Signed-Rank test
The Kruskal-Wallis test
Chi-square distribution
Chi-square goodness-of-fit
Chi-square test of independence
Chi-square goodness-of-fit test power analysis
Spearman’s rank correlation coefficient
Summary
Part 2: Regression Models
6
Simple Linear Regression
Simple linear regression using OLS
Coefficients of correlation and determination
Coefficients of correlation
Coefficients of determination
Required model assumptions
A linear relationship between the variables
Normality of the residuals
Homoscedasticity of the residuals
Sample independence
Testing for significance and validating models
Model validation
Summary
7
Multiple Linear Regression
Multiple linear regression
Adding categorical variables
Evaluating model fit
Interpreting the results
Feature selection
Statistical methods for feature selection
Performance-based methods for feature selection
Recursive feature elimination
Shrinkage methods
Ridge regression
LASSO regression
Elastic Net
Dimension reduction
PCA – a hands-on introduction
PCR – a hands-on salary prediction study
Summary
Part 3: Classification Models
8
Discrete Models
Probit and logit models
Multinomial logit model
Poisson model
The Poisson distribution
Modeling count data
The negative binomial regression model
Negative binomial distribution
Summary
9
Discriminant Analysis
Bayes’ theorem
Probability
Conditional probability
Discussing Bayes’ Theorem
Linear Discriminant Analysis
Supervised dimension reduction
Quadratic Discriminant Analysis
Summary
Part 4: Time Series Models
10
Introduction to Time Series
What is a time series?
Goals of time series analysis
Statistical measurements
Mean
Variance
Autocorrelation
Cross-correlation
The white-noise model
Stationarity
Summary
References
11
ARIMA Models
Technical requirements
Models for stationary time series
Autoregressive (AR) models
Moving average (MA) models
Autoregressive moving average (ARMA) models
Models for non-stationary time series
ARIMA models
Seasonal ARIMA models
More on model evaluation
Summary
References
12
Multivariate Time Series
Multivariate time series
Time-series cross-correlation
ARIMAX
Preprocessing the exogenous variables
Fitting the model
Assessing model performance
VAR modeling
Step 1 – visual inspection
Step 2 – selecting the order of AR(p)
Step 3 – assessing cross-correlation
Step 4 – building the VAR(p,q) model
Step 5 – testing the forecast
Step 6 – building the forecast
Summary
References
Part 5: Survival Analysis
13
Time-to-Event Variables – An Introduction
What is censoring?
Left censoring
Right censoring
Interval censoring
Type I and Type II censoring
Survival data
Survival Function, Hazard and Hazard Ratio
Summary
14
Survival Models
Technical requirements
Kaplan-Meier model
Model definition
Model example
Exponential model
Model example
Cox Proportional Hazards regression model
Step 1
Step 2
Step 3
Step 4
Step 5
Summary
Index
Other Books You May Enjoy
Preface
Statistics is a discipline of study used for applying analytical methods to answer questions and solve problems using data, in both academic and industry settings. Many methods have been around for centuries, while others are much more recent. Statistical analysis and results are fairly straightforward for presenting to both technical and non-technical audiences. Furthermore, producing results with statistical analysis does not necessarily require large amounts of data or compute resources and can be done fairly quickly, especially when using programming languages such as Python, which is moderately easy to work with and implement.
While artificial intelligence (AI) and advanced machine learning (ML) tools have become more prominent and popular over recent years with the increase of accessibility in compute power, performing statistical analysis as a precursor to developing larger-scale projects using AI and ML can enable a practitioner to assess feasibility and practicality before using larger compute resources and project architecture development for those types of projects.
This book provides a wide variety of tools that are commonly used to test hypotheses and provide basic predictive capabilities to analysts and data scientists alike. The reader will walk through the basic concepts and terminology required for understanding the statistical tools in this book prior to exploring the different tests and conditions under which they are applicable. Further, the reader will gain knowledge for assessing the performance of the tests. Throughout, examples will be provided in the Python programming language to get readers started understanding their data using the tools presented, which will be applicable to some of the most common questions faced in the data analytics industry. The topics we will walk through include:
An introduction to statistics
Regression models
Classification models
Time series models
Survival analysis
Understanding the tools provided in these sections will provide the reader with a firm foundation from which further independent growth in the statistics domain can more easily be achieved.
Who this book is for
Professionals in most industries can benefit from the tools in this book. The tools provided are useful primarily at a higher level of inferential analysis, but can be applied to deeper levels depending on the industry in which the practitioner wishes to apply them. The target audiences of this book are:
Industry professionals with limited statistical or programming knowledge who would like to learn to use data for testing hypotheses they have in their business domain
Data analysts and scientists who wish to broaden their statistical knowledge and find a set of tools and their implementations for performing various data-oriented tasks
The ground-up approach of this book seeks to provide entry into the knowledge base for a wide audience and therefore should neither discourage novice-level practitioners nor exclude advanced-level practitioners from the benefits of the materials presented.
What this book covers
Chapter 1, Sampling and Generalization, describes the concepts of sampling and generalization. The discussion of sampling covers several common methods for sampling data from a population and discusses the implications for generalization. This chapter also discusses how to setup the software required for this book.
Chapter 2, Distributions of Data, provides a detailed introduction to types of data, common distributions used to describe data, and statistical measures. This chapter also covers common transformations used to change distributions.
Chapter 3, Hypothesis Testing, introduces the concept of statistical tests as a method for answering questions of interest. This chapter covers the steps to perform a test, the types of errors encountered in testing, and how to select power using the Z-test.
Chapter 4, Parametric Tests, further discusses statistical tests, providing detailed descriptions of common parametric statistical tests, the assumptions of parametric tests, and how to assess the validity of parametric tests. This chapter also introduces the concept of multiple tests and provides details on corrections for multiple tests.
Chapter 5, Non-parametric Tests, discuss how to perform statistical tests when the assumptions of parametric tests are violated with class of tests without assumptions called non-parametric tests.
Chapter 6, Simple Linear Regression, introduces the concept of a statistical model with the simple linear regression model. This chapter begins by discussing the theoretical foundations of simple linear regression and then discusses how to interpret the results of the model and assess the validity of the model.
Chapter 7, Multiple Linear Regression, builds on the previous chapter by extending the simple linear regression model into additional dimensions. This chapter also discusses issues that occur when modeling with multiple explanatory variables, including multicollinearity, feature selection, and dimension reduction.
Chapter 8, Discrete Models, introduces the concept of classification and develops a model for classifying variables into discrete levels of a categorical response variable. This chapter starts by developing the model binary classification and then extends the model to multivariate classification. Finally, the Poisson model and negative binomial models are covered.
Chapter 9, Discriminant Analysis, discusses several additional models for classification, including linear discriminant analysis and quadratic discriminant analysis. This chapter also introduces Bayes’ Theorem.
Chapter 10, Introduction to Time Series, introduces time series data, discussing the time series concept of autocorrelation and the statistical measures for time series. This chapter also introduces the white noise model and stationarity.
Chapter 11, ARIMA Models, discusses models for univariate models. This chapter starts by discussing models for stationary time series and then extends the discussion to non-stationary time series. Finally, this chapter provides a detailed discussion on model evaluation.
Chapter 12, Multivariate Time Series, builds on the previous two chapters by introducing the concept of a multivariate time series and extends ARIMA models to multiple explanatory variables. This chapter also discusses time series cross-correlation.
Chapter 13, Survival Analysis, introduces survival data, also called time-to-event data. This chapter discusses the concept of censoring and the impact of censoring survival data. Finally, the chapter discusses the survival function, hazard, and hazard ratio.
Chapter 14, Survival Models, building on the previous chapter, provides an overview of several models for survival data, including the Kaplan-Meier model, the Exponential model, and the Cox Proportional Hazards model.
To get the most out of this book
You will need access to download and install open-source code packages implemented in the Python programming language and accessible through PyPi.org or the Anaconda Python distribution. While a background in statistics is helpful, but not necessary, this book assumes you have a decent background in basic algebra. Each unit of this book is independent of the other units, but the chapters within each unit build upon each other. Thus, we advise you to begin each unit with that unit’s first chapter to understand the content.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Statistical-Models-in-Python. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.
A block of code is set as follows:
A = [3,5,4]
B = [43,41,56,78,54]
permutation_testing(A,B,n_iter=10000)
Any command-line input or output is written as follows:
pip install SomePackage
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Select System info from the Administration panel.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Share Your Thoughts
Once you’ve read Building Statistical Models in Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/978-1-80461-428-0
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1:Introduction to Statistics
This part will cover the statistical concepts that are foundational to statistical modeling.
It includes the following chapters:
Chapter 1, Sampling and Generalization
Chapter 2, Distributions of Data
Chapter 3, Hypothesis Testing
Chapter 4, Parametric Tests
Chapter 5, Non-Parametric Tests
1
Sampling and Generalization
In this chapter, we will describe the concept of populations and sampling from populations, including some common strategies for sampling. The discussion of sampling will lead to a section that will describe generalization. Generalization will be discussed as it relates to using samples to make conclusions about their respective populations. When modeling for statistical inference, it is necessary to ensure that samples can be generalized to populations. We will provide an in-depth overview of this bridge through the subjects in this chapter.
We will cover the following main topics:
Software and environment setup
Population versus sample
Population inference from samples
Sampling strategies – random, systematic, and stratified
Software and environment setup
Python is one of the most popular programming languages for data science and machine learning thanks to the large open source community that has driven the development of these libraries. Python’s ease of use and flexible nature made it a prime candidate in the data science world, where experimentation and iteration are key features of the development cycle. While there are new languages in development for data science applications, such as Julia, Python currently remains the key language for data science due to its wide breadth of open source projects, supporting applications from statistical modeling to deep learning. We have chosen to use Python in this book due to its positioning as an important language for data science and its demand in the job market.
Python is available for all major operating systems: Microsoft Windows, macOS, and Linux. Additionally, the installer and documentation can be found at the official website: https://www.python.org/.
This book is written for Python version 3.8 (or higher). It is recommended that you use whatever recent version of Python that is available. It is not likely that the code found in this book will be compatible with Python 2.7, and most active libraries have already started dropping support for Python 2.7 since official support ended in 2020.
The libraries used in this book can be installed with the Python package manager, pip, which is part of the standard Python library in contemporary versions of Python. More information about pip can be found here: https://docs.python.org/3/installing/index.html. After pip is installed, packages can be installed using pip on the command line. Here is basic usage at a glance:
Install a new package using the latest version:
pip install SomePackage
Install the package with a specific version, version 2.1 in this example:
pip install SomePackage==2.1
A package that is already installed can be upgraded with the --upgrade flag:
pip install SomePackage –upgrade
In general, it is recommended to use Python virtual environments between projects and to keep project dependencies separate from system directories. Python provides a virtual environment utility, venv, which, like pip, is part of the standard library in contemporary versions of Python. Virtual environments allow you to create individual binaries of Python, where each binary of Python has its own set of installed dependencies. Using virtual environments can prevent package version issues and conflict when working on multiple Python projects. Details on setting up and using virtual environments can be found here: https://docs.python.org/3/library/venv.html.
While we recommend the use of Python and Python’s virtual environments for environment setups, a highly recommended alternative is Anaconda. Anaconda is a free (enterprise-ready) analytics-focused distribution of Python by Anaconda Inc. (previously Continuum Analytics). Anaconda distributions come with many of the core data science packages, common IDEs (such as Jupyter and Visual Studio Code), and a graphical user interface for managing environments. Anaconda can be installed using the installer found at the Anaconda website here: https://www.anaconda.com/products/distribution.
Anaconda comes with its own package manager, conda, which can be used to install new packages similarly to pip.
Install a new package using the latest version:
conda install SomePackage
Upgrade a package that is already installed:
conda upgrade SomePackage
Throughout this book, we will make use of several core libraries in the Python data science ecosystem, such as NumPy for array manipulations, pandas for higher-level data manipulations, and matplotlib for data visualization. The package versions used for this book are contained in the following list. Please ensure that the versions installed in your environment are equal to or greater than the versions listed. This will help ensure that the code examples run correctly:
statsmodels 0.13.2
Matplotlib 3.5.2
NumPy 1.23.0
SciPy 1.8.1
scikit-learn 1.1.1
pandas 1.4.3
The packages used for the code in this book are shown here in Figure 1.1. The __version__ method can be used to print the package version in code.
Figure 1.1 – Package versions used in this bookFigure 1.1 – Package versions used in this book
Having set up the technical environment for the book, let’s get into the statistics. In the next sections, we will discuss the concepts of population and sampling. We will demonstrate sampling strategies with code implementations.
Population versus sample
In general, the goal of statistical modeling is to answer a question about a group by making an inference about that group. The group we are making an inference on could be machines in a production factory, people voting in an election, or plants on different plots of land. The entire group, every individual item or entity, is referred to as the population. In most cases, the population of interest is so large that it is not practical or even possible to collect data on every entity in the population. For instance, using the voting example, it would probably not be possible to poll every person that voted in an election. Even if it was possible to reach all the voters for the election of interest, many voters may not consent to polling, which would prevent collection on the entire population. An additional consideration would be the expense of polling such a large group. These factors make it practically impossible to collect population statistics in our example of vote polling. These types of prohibitive factors exist in many cases where we may want to assess a population-level attribute. Fortunately, we do not need to collect data on the entire population of interest. Inferences about a population can be made using a subset of the population. This subset of the population is called a sample. This is the main idea of statistical modeling. A model will be created using a sample and inferences will be made about the population.
In order to make valid inferences about the population of interest using a sample, the sample must be representative of the population of interest, meaning that the sample should contain the variation found in the population. For example, if we were interested in making an inference about plants in a field, it is unlikely that samples from one corner of the field would be sufficient for inferences about the larger population. There would likely be variations in plant characteristics over the entire field. We could think of various reasons why there might be variation. For this example, we will consider some examples from Figure 1.2.
Figure 1.2 – Field of plantsFigure 1.2 – Field of plants
The figure shows that Sample A is near a forest. This sample area may be affected by the presence of the forest; for example, some of the plants in that sample may receive less sunlight than plants in the other sample. Sample B is shown to be in between the main irrigation lines. It’s conceivable that this sample receives more water on average than the other two samples, which may have an effect on the plants in this sample. The final Sample C is near a road. This sample may see other effects that are not seen in Sample A or B.
If samples were only taken from one of those sections, the inferences from those samples would be biased and would not provide valid references about the population. Thus, samples would need to be taken from across the entire field to create a sample that is more likely to be representative of the population of plants. When taking samples from populations, it is critical to ensure the sampling method is robust to possible issues, such as the influence of irrigation and shade in the previous example. Whenever taking a sample from a population, it’s important to identify and mitigate possible influences of bias because biases in data will affect your model and skew your conclusions.
In the next section, various methods for sampling from a dataset will be discussed. An additional consideration is the sample size. The sample size impacts the type of statistical tools we can use, the distributional assumptions that can be made about the sample, and the confidence of inferences and predictions. The impact of sample size will be explored in depth in Chapter 2, Distributions of Data and Chapter 3, Hypothesis Testing.
Population inference from samples
When using a statistical model to make inferential conclusions about a population from a sample subset of that population, the study design must account for similar degrees of uncertainty in its variables as those in the population. This is the variation mentioned earlier in this chapter. To appropriately draw inferential conclusions about a population, any statistical model must be structured around a chance mechanism. Studies structured around these chance mechanisms are called randomized experiments and provide an understanding of both correlation and causation.
Randomized experiments
There are two primary characteristics of a randomized experiment:
Random sampling, colloquially referred to as random selection
Random assignment of treatments, which is the nature of the study
Random sampling
Random sampling (also called random selection) is designed with the intent of creating a sample representative of the overall population so that statistical models generalize the population well enough to assign cause-and-effect outcomes. In order for random sampling to be successful, the population of interest must be well defined. All samples taken from the population must have a chance of being selected. In considering the example of polling voters, all voters must be willing to be polled. Once all voters are entered into a lottery, random sampling can be used to subset voters for modeling. Sampling from only voters who are willing to be polled introduces sampling bias into statistical modeling, which can lead to skewed results. The sampling method in the scenario where only some voters are willing to participate is called self-selection. Any information obtained and modeled from self-selected samples – or any non-random samples – cannot be used for inference.
Random assignment of treatments
The random assignment of treatments refers to two motivators:
The first motivator is to gain an understanding of specific input variables and their influence on the response – for example, understanding whether assigning treatment A to a specific individual may produce more favorable outcomes than a placebo.
The second motivator is to remove the impact of external variables on the outcomes of a study. These external variables, called confounding variables (or confounders), are important to remove as they often prove difficult to control. They may have unpredictable values or even be unknown to the researcher. The consequence of including confounders is that the outcomes of a study may not be replicable, which can be costly. While confounders can influence outcomes, they can also influence input variables, as well as the relationships between those variables.
Referring back to the example in the earlier section, Population versus sample, consider a farmer who decides to start using pesticides on his crops and wants to test two different brands. The farmer knows there are three distinct areas of the land; plot A, plot B, and plot C. To determine the success of the pesticides and prevent damage to the crops, the farmer randomly chooses 60 plants from each plot (this is called stratified random sampling where random sampling is stratified across each plot) for testing. This