PDF From Algorithms To Z Scores Matloff N Ebook Full Chapter

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

From algorithms to Z scores Matloff N.

Visit to download the full and correct content document:


https://textbookfull.com/product/from-algorithms-to-z-scores-matloff-n/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Mergers and Acquisitions from A to Z Andrew J. Sherman

https://textbookfull.com/product/mergers-and-acquisitions-from-a-
to-z-andrew-j-sherman/

Evaluation Essentials Second Edition From A to Z Marvin


C Alkin Anne T Vo

https://textbookfull.com/product/evaluation-essentials-second-
edition-from-a-to-z-marvin-c-alkin-anne-t-vo/

Mathematics of Epidemics on Networks From Exact to


Approximate Models 1st Edition István Z. Kiss

https://textbookfull.com/product/mathematics-of-epidemics-on-
networks-from-exact-to-approximate-models-1st-edition-istvan-z-
kiss/

The Crypto Encyclopedia Coins Tokens And Digital Assets


From A To Z 1st Edition Schueffel

https://textbookfull.com/product/the-crypto-encyclopedia-coins-
tokens-and-digital-assets-from-a-to-z-1st-edition-schueffel/
Introduction to Electromagnetism: From Coulomb to
Maxwell 2nd Edition Martin J N Sibley

https://textbookfull.com/product/introduction-to-
electromagnetism-from-coulomb-to-maxwell-2nd-edition-martin-j-n-
sibley/

Archery From A to Z An Introductory Guide to a Sport


Everyone Can Enjoy 1st Edition Christian Berg

https://textbookfull.com/product/archery-from-a-to-z-an-
introductory-guide-to-a-sport-everyone-can-enjoy-1st-edition-
christian-berg/

Glowworm Swarm Optimization Theory Algorithms and


Applications 1st Edition Krishnanand N. Kaipa

https://textbookfull.com/product/glowworm-swarm-optimization-
theory-algorithms-and-applications-1st-edition-krishnanand-n-
kaipa/

Margin Trading from A to Z A Complete Guide to


Borrowing Investing and Regulation 1st Edition Curley
Michael T

https://textbookfull.com/product/margin-trading-from-a-to-z-a-
complete-guide-to-borrowing-investing-and-regulation-1st-edition-
curley-michael-t/

An Introduction to Interdisciplinary Toxicology: From


Molecules to Man 1st Edition Carey N. Pope (Editor)

https://textbookfull.com/product/an-introduction-to-
interdisciplinary-toxicology-from-molecules-to-man-1st-edition-
carey-n-pope-editor/
From Algorithms to Z-Scores:
Probabilistic and Statistical Modeling in
Computer Science
Norm Matloff, University of California, Davis

library(MASS)
−0.5(t−µ)0 Σ−1 (t−µ)
fX (t) = ce x <- mvrnorm(mu,sgm)

0.015

0.010
z

0.005
10
5
−10 0
−5 x2
0 −5
x1 5
10 −10

See Creative Commons license at


http://heather.cs.ucdavis.edu/ matloff/probstatbook.html
The author has striven to minimize the number of errors, but no guarantee is made as to accuracy
of the contents of this book.
2

Author’s Biographical Sketch

Dr. Norm Matloff is a professor of computer science at the University of California at Davis, and
was formerly a professor of statistics at that university. He is a former database software developer
in Silicon Valley, and has been a statistical consultant for firms such as the Kaiser Permanente
Health Plan.
Dr. Matloff was born in Los Angeles, and grew up in East Los Angeles and the San Gabriel Valley.
He has a PhD in pure mathematics from UCLA, specializing in probability theory and statistics. He
has published numerous papers in computer science and statistics, with current research interests
in parallel processing, statistical computing, and regression methodology.
Prof. Matloff is a former appointed member of IFIP Working Group 11.3, an international com-
mittee concerned with database software security, established under UNESCO. He was a founding
member of the UC Davis Department of Statistics, and participated in the formation of the UCD
Computer Science Department as well. He is a recipient of the campuswide Distinguished Teaching
Award and Distinguished Public Service Award at UC Davis.
Dr. Matloff is the author of two published textbooks, and of a number of widely-used Web tutorials
on computer topics, such as the Linux operating system and the Python programming language.
He and Dr. Peter Salzman are authors of The Art of Debugging with GDB, DDD, and Eclipse.
Prof. Matloff’s book on the R programming language, The Art of R Programming, was published in
2011. His book, Parallel Computation for Data Science, will come out in early 2015. He is also the
author of several open-source textbooks, including From Algorithms to Z-Scores: Probabilistic and
Statistical Modeling in Computer Science (http://heather.cs.ucdavis.edu/probstatbook), and
Programming on Parallel Machines (http://heather.cs.ucdavis.edu/~matloff/ParProcBook.
pdf).
Contents

1 Time Waste Versus Empowerment 1

2 Basic Probability Models 3


2.1 ALOHA Network Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The Crucial Notion of a Repeatable Experiment . . . . . . . . . . . . . . . . . . . . 5
2.3 Our Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 “Mailing Tubes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Basic Probability Computations: ALOHA Network Example . . . . . . . . . . . . . 10
2.6 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 ALOHA in the Notebook Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Solution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Example: Divisibility of Random Integers . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Example: A Simple Board Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.11 Example: Bus Ridership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.12 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12.1 Example: Rolling Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12.2 Improving the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.12.2.1 Simulation of Conditional Probability in Dice Problem . . . . . . . 24
2.12.3 Simulation of the ALOHA Example . . . . . . . . . . . . . . . . . . . . . . . 25

i
ii CONTENTS

2.12.4 Example: Bus Ridership, cont’d. . . . . . . . . . . . . . . . . . . . . . . . . . 26


2.12.5 Back to the Board Game Example . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12.6 How Long Should We Run the Simulation? . . . . . . . . . . . . . . . . . . . 27
2.13 Combinatorics-Based Probability Computation . . . . . . . . . . . . . . . . . . . . . 27
2.13.1 Which Is More Likely in Five Cards, One King or Two Hearts? . . . . . . . . 27
2.13.2 Example: Random Groups of Students . . . . . . . . . . . . . . . . . . . . . . 29
2.13.3 Example: Lottery Tickets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.13.4 “Association Rules” in Data Mining . . . . . . . . . . . . . . . . . . . . . . . 30
2.13.5 Multinomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.13.6 Example: Probability of Getting Four Aces in a Bridge Hand . . . . . . . . . 31

3 Discrete Random Variables 37


3.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Example: The Monty Hall Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Generality—Not Just for DiscreteRandom Variables . . . . . . . . . . . . . . 40
3.5.1.1 What Is It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Existence of the Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.4 Computation and Properties of Expected Value . . . . . . . . . . . . . . . . . 41
3.5.5 “Mailing Tubes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.6 Casinos, Insurance Companies and “Sum Users,” Compared to Others . . . . 46
3.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.2 Central Importance of the Concept of Variance . . . . . . . . . . . . . . . . . 51
CONTENTS iii

3.6.3 Intuition Regarding the Size of Var(X) . . . . . . . . . . . . . . . . . . . . . . 51


3.6.3.1 Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6.3.2 The Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 A Useful Fact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9 Indicator Random Variables, and Their Means and Variances . . . . . . . . . . . . . 54
3.9.1 Example: Return Time for Library Books . . . . . . . . . . . . . . . . . . . . 55
3.9.2 Example: Indicator Variables in a Committee Problem . . . . . . . . . . . . . 55
3.10 Expected Value, Etc. in the ALOHA Example . . . . . . . . . . . . . . . . . . . . . 57
3.11 Example: Measurements at Different Ages . . . . . . . . . . . . . . . . . . . . . . . . 58
3.12 Example: Bus Ridership Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.13 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.13.1 Example: Toss Coin Until First Head . . . . . . . . . . . . . . . . . . . . . . 60
3.13.2 Example: Sum of Two Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.13.3 Example: Watts-Strogatz Random Graph Model . . . . . . . . . . . . . . . . 60
3.13.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.13.3.2 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.14 Parameteric Families of pmfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.14.1 Parameteric Families of Functions . . . . . . . . . . . . . . . . . . . . . . . . 62
3.14.2 The Case of Importance to Us: Parameteric Families of pmfs . . . . . . . . . 63
3.14.3 The Geometric Family of Distributions . . . . . . . . . . . . . . . . . . . . . . 63
3.14.3.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.14.3.2 Example: a Parking Space Problem . . . . . . . . . . . . . . . . . . 66
3.14.4 The Binomial Family of Distributions . . . . . . . . . . . . . . . . . . . . . . 68
3.14.4.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.14.4.2 Example: Flipping Coins with Bonuses . . . . . . . . . . . . . . . . 70
3.14.4.3 Example: Analysis of Social Networks . . . . . . . . . . . . . . . . . 71
iv CONTENTS

3.14.5 The Negative Binomial Family of Distributions . . . . . . . . . . . . . . . . . 72


3.14.5.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.14.5.2 Example: Backup Batteries . . . . . . . . . . . . . . . . . . . . . . . 74
3.14.6 The Poisson Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 74
3.14.6.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.14.7 The Power Law Family of Distributions . . . . . . . . . . . . . . . . . . . . . 75
3.14.7.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.14.7.2 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.15 Recognizing Some Parametric Distributions When You See Them . . . . . . . . . . . 77
3.15.1 Example: a Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.15.2 Example: Tossing a Set of Four Coins . . . . . . . . . . . . . . . . . . . . . . 79
3.15.3 Example: the ALOHA Example Again . . . . . . . . . . . . . . . . . . . . . . 79
3.16 Example: the Bus Ridership Problem Again . . . . . . . . . . . . . . . . . . . . . . . 80
3.17 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.18 Iterated Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.18.1 The Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.18.2 Example: Coin and Die Game . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.19 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.19.1 Trick Coins, Tricky Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.19.2 Intuition in Retrospect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.19.3 Implications for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.20 Why Not Just Do All Analysis by Simulation? . . . . . . . . . . . . . . . . . . . . . 86
3.21 Proof of Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.22 Reconciliation of Math and Intuition (optional section) . . . . . . . . . . . . . . . . . 88

4 Introduction to Discrete Markov Chains 95


4.1 Matrix Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
CONTENTS v

4.2 Example: Die Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96


4.3 Long-Run State Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Calculation of π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Example: 3-Heads-in-a-Row Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Example: ALOHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6 Example: Bus Ridership Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.7 Example: an Inventory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5 Continuous Probability Models 105


5.1 A Random Dart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Continuous Random Variables Are “Useful Unicorns” . . . . . . . . . . . . . . . . . 106
5.3 But Now We Have a Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.1 Motivation, Definition and Interpretation . . . . . . . . . . . . . . . . . . . . 110
5.4.2 Properties of Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.3 A First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.4 The Notion of Support in the Continuous Case . . . . . . . . . . . . . . . . . 115
5.5 Famous Parametric Families of Continuous Distributions . . . . . . . . . . . . . . . . 116
5.5.1 The Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1.2 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1.3 Example: Modeling of Disk Performance . . . . . . . . . . . . . . . 116
5.5.1.4 Example: Modeling of Denial-of-Service Attack . . . . . . . . . . . . 117
5.5.2 The Normal (Gaussian) Family of Continuous Distributions . . . . . . . . . . 117
5.5.2.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5.3 The Chi-Squared Family of Distributions . . . . . . . . . . . . . . . . . . . . 118
5.5.3.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 118
vi CONTENTS

5.5.3.2 Example: Error in Pin Placement . . . . . . . . . . . . . . . . . . . 119


5.5.3.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.4 The Exponential Family of Distributions . . . . . . . . . . . . . . . . . . . . . 120
5.5.4.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.4.2 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.4.3 Example: Refunds on Failed Components . . . . . . . . . . . . . . . 121
5.5.4.4 Example: Garage Parking Fees . . . . . . . . . . . . . . . . . . . . . 121
5.5.4.5 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.5 The Gamma Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.5.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.5.2 Example: Network Buffer . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.5.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5.6 The Beta Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5.6.1 Density Etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.6.2 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Choosing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.7 A General Method for Simulating a Random Variable . . . . . . . . . . . . . . . . . 127
5.8 Example: Writing a Set of R Functions for a Certain Power Family . . . . . . . . . . 128
5.9 Multivariate Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.10 “Hybrid” Continuous/Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . 130
5.11 Iterated Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.11.1 The Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.11.2 Example: Another Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 The Normal Family of Distributions 135


6.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1 Closure Under Affine Transformation . . . . . . . . . . . . . . . . . . . . . . . 135
CONTENTS vii

6.1.2 Closure Under Independent Summation . . . . . . . . . . . . . . . . . . . . . 136


6.1.3 Evaluating Normal cdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 Example: Network Intrusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3 Example: Class Enrollment Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4 More on the Jill Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5 Example: River Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6 Example: Upper Tail of a Light Bulb Distribution . . . . . . . . . . . . . . . . . . . 141
6.7 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.8 Example: Cumulative Roundoff Error . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.9 Example: R Evaluation of a Central Limit Theorem Approximation . . . . . . . . . 142
6.10 Example: Bug Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.11 Example: Coin Tosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.12 Museum Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.13 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.14 The Multivariate Normal Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.15 Optional Topic: Precise Statement of the CLT . . . . . . . . . . . . . . . . . . . . . 146
6.15.1 Convergence in Distribution, and the Precisely-Stated CLT . . . . . . . . . . 147

7 The Exponential Distributions 149


7.1 Connection to the Poisson Distribution Family . . . . . . . . . . . . . . . . . . . . . 149
7.2 Memoryless Property of Exponential Distributions . . . . . . . . . . . . . . . . . . . 151
7.2.1 Derivation and Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.2 Uniquely Memoryless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2.3 Example: “Nonmemoryless” Light Bulbs . . . . . . . . . . . . . . . . . . . . . 153
7.3 Example: Minima of Independent Exponentially Distributed Random Variables . . . 153
7.3.1 Example: Computer Worm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.2 Example: Electronic Components . . . . . . . . . . . . . . . . . . . . . . . . . 157
viii CONTENTS

7.4 A Cautionary Tale: the Bus Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . 157


7.4.1 Length-Biased Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4.2 Probability Mass Functions and Densities in Length-Biased Sampling . . . . 159

8 Stop and Review: Probability Structures 161

9 Covariance and Random Vectors 167


9.1 Measuring Co-variation of Random Variables . . . . . . . . . . . . . . . . . . . . . . 167
9.1.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.1.2 Example: Variance of Sum of Nonindependent Variables . . . . . . . . . . . . 169
9.1.3 Example: the Committee Example Again . . . . . . . . . . . . . . . . . . . . 169
9.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.2.1 Example: a Catchup Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.3 Sets of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.3.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1.1 Expected Values Factor . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1.2 Covariance Is 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1.3 Variances Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.2 Examples Involving Sets of Independent Random Variables . . . . . . . . . . 173
9.3.2.1 Example: Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.2.2 Example: Variance of a Product . . . . . . . . . . . . . . . . . . . . 174
9.3.2.3 Example: Ratio of Independent Geometric Random Variables . . . 174
9.4 Matrix Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4.1 Properties of Mean Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4.2 Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4.3 Covariance Matrices Linear Combinations of Random Vectors . . . . . . . . . 177
9.4.4 Example: (X,S) Dice Example Again . . . . . . . . . . . . . . . . . . . . . . . 178
CONTENTS ix

9.4.5 Example: Easy Sum Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178


9.5 The Multivariate Normal Family of Distributions . . . . . . . . . . . . . . . . . . . . 179
9.5.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.5.2 Special Case: New Variable Is a Single Linear Combination of a Random Vector180
9.6 Indicator Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.7 Example: Dice Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.7.1 Correlation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.7.2 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

10 Statistics: Prologue 187


10.1 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10.1.1 Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10.1.2 The Sample Mean—a Random Variable . . . . . . . . . . . . . . . . . . . . . 189
10.1.3 Sample Means Are Approximately Normal—No Matter What the Population
Distribution Is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.1.4 The Sample Variance—Another Random Variable . . . . . . . . . . . . . . . 191
10.1.4.1 Intuitive Estimation of σ 2 . . . . . . . . . . . . . . . . . . . . . . . . 192
10.1.4.2 Easier Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.1.4.3 To Divide by n or n-1? . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2 A Good Time to Stop and Review! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

11 Introduction to Confidence Intervals 195


11.1 The “Margin of Error” and Confidence Intervals . . . . . . . . . . . . . . . . . . . . 195
11.2 Confidence Intervals for Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
11.2.1 Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.2.2 Example: Simulation Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.3 Meaning of Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.3.1 A Weight Survey in Davis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
x CONTENTS

11.3.2 More About Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199


11.4 Confidence Intervals for Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.4.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.4.2 That n vs. n-1 Thing Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11.4.3 Simulation Example Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11.4.4 Example: Davis Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.4.6 (Non-)Effect of the Population Size . . . . . . . . . . . . . . . . . . . . . . . . 204
11.4.7 Inferring the Number Polled . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.4.8 Planning Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.5 General Formation of Confidence Intervals from Approximately Normal Estimators . 205
11.5.1 Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.5.2 Standard Errors of Combined Estimators . . . . . . . . . . . . . . . . . . . . 207
11.6 Confidence Intervals for Differences of Means or Proportions . . . . . . . . . . . . . . 207
11.6.1 Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.6.2 Example: Network Security Application . . . . . . . . . . . . . . . . . . . . . 209
11.6.3 Dependent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.6.4 Example: Machine Classification of Forest Covers . . . . . . . . . . . . . . . . 211
11.7 And What About the Student-t Distribution? . . . . . . . . . . . . . . . . . . . . . . 212
11.8 R Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.9 Example: Pro Baseball Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.9.1 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.9.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.10Example: UCI Bank Marketing Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.11Example: Amazon Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.12Example: Master’s Degrees in CS/EE . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.13Other Confidence Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
CONTENTS xi

11.14One More Time: Why Do We Use Confidence Intervals? . . . . . . . . . . . . . . . . 220

12 Introduction to Significance Tests 223


12.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.2 General Testing Based on Normally Distributed Estimators . . . . . . . . . . . . . . 225
12.3 Example: Network Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.4 The Notion of “p-Values” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.5 Example: Bank Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.6 One-Sided HA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.7 Exact Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.7.1 Example: Test for Biased Coin . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.7.2 Example: Improved Light Bulbs . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.7.3 Example: Test Based on Range Data . . . . . . . . . . . . . . . . . . . . . . . 230
12.7.4 Exact Tests under a Normal Distribution Assumption . . . . . . . . . . . . . 231
12.8 Don’t Speak of “the Probability That H0 Is True” . . . . . . . . . . . . . . . . . . . 231
12.9 R Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.10The Power of a Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.10.1 Example: Coin Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.10.2 Example: Improved Light Bulbs . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.11What’s Wrong with Significance Testing—and What to Do Instead . . . . . . . . . . 233
12.11.1 History of Significance Testing, and Where We Are Today . . . . . . . . . . . 234
12.11.2 The Basic Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.11.3 You Be the Judge! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.11.4 What to Do Instead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.11.5 Decide on the Basis of “the Preponderance of Evidence” . . . . . . . . . . . . 237
12.11.6 Example: the Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.11.7 Example: Assessing Your Candidate’s Chances for Election . . . . . . . . . . 238
xii CONTENTS

13 General Statistical Estimation and Inference 239


13.1 General Methods of Parametric Estimation . . . . . . . . . . . . . . . . . . . . . . . 239
13.1.1 Example: Guessing the Number of Raffle Tickets Sold . . . . . . . . . . . . . 239
13.1.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.1.3 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 241
13.1.4 Example: Estimation the Parameters of a Gamma Distribution . . . . . . . . 242
13.1.4.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
13.1.4.2 MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.1.4.3 R’s mle() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.1.5 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
13.1.6 What About Confidence Intervals? . . . . . . . . . . . . . . . . . . . . . . . . 247
13.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.2.2 Why Divide by n-1 in s2 ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
13.2.2.1 But in This Book, We Divide by n, not n-1 Anyway . . . . . . . . . 251
13.2.3 Example of Bias Calculation: Max from U(0,c) . . . . . . . . . . . . . . . . . 252
13.2.4 Example of Bias Calculation: Gamma Family . . . . . . . . . . . . . . . . . . 252
13.2.5 Tradeoff Between Variance and Bias . . . . . . . . . . . . . . . . . . . . . . . 253
13.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.3.1 How It Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.3.1.1 Empirical Bayes Methods . . . . . . . . . . . . . . . . . . . . . . . . 256
13.3.2 Extent of Usage of Subjective Priors . . . . . . . . . . . . . . . . . . . . . . . 257
13.3.3 Arguments Against Use of Subjective Priors . . . . . . . . . . . . . . . . . . . 257
13.3.4 What Would You Do? A Possible Resolution . . . . . . . . . . . . . . . . . . 259
13.3.5 The Markov Chain Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . 259
13.3.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
CONTENTS xiii

14 Histograms and Beyond: Nonparametric Density Estimation 263


14.1 Basic Ideas in Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
14.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
14.3 Kernel-Based Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
14.4 Example: Baseball Player Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.5 More on Density Estimation in ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.6 Bias, Variance and Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.6.1 Bias vs. Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
14.6.2 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
14.7 Nearest-Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.8 Estimating a cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
14.9 Hazard Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
14.10For Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

15 Simultaneous Inference Methods 275


15.1 The Bonferonni Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
15.2 Scheffe’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15.4 Other Methods for Simultaneous Inference . . . . . . . . . . . . . . . . . . . . . . . . 279

16 Linear Regression 281


16.1 The Goals: Prediction and Description . . . . . . . . . . . . . . . . . . . . . . . . . . 281
16.2 Example Applications: Software Engineering, Networks, Text Mining . . . . . . . . . 282
16.3 Adjusting for Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.4 What Does “Relationship” Really Mean? . . . . . . . . . . . . . . . . . . . . . . . . 284
16.4.1 Precise Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
16.4.2 (Rather Artificial) Example: Marble Problem . . . . . . . . . . . . . . . . . . 285
xiv CONTENTS

16.5 Estimating That Relationship from Sample Data . . . . . . . . . . . . . . . . . . . . 286


16.5.1 Parametric Models for the Regression Function m() . . . . . . . . . . . . . . 286
16.5.2 Estimation in Parametric Regression Models . . . . . . . . . . . . . . . . . . 287
16.5.3 More on Parametric vs. Nonparametric Models . . . . . . . . . . . . . . . . . 288
16.6 Example: Baseball Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
16.6.1 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
16.6.2 A Look through the Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
16.7 Multiple Regression: More Than One Predictor Variable . . . . . . . . . . . . . . . . 292
16.8 Example: Baseball Data (cont’d.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
16.9 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
16.10Parametric Estimation of Linear Regression Functions . . . . . . . . . . . . . . . . . 295
16.10.1 Meaning of “Linear” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
16.10.2 Random-X and Fixed-X Regression . . . . . . . . . . . . . . . . . . . . . . . 296
16.10.3 Point Estimates and Matrix Formulation . . . . . . . . . . . . . . . . . . . . 296
16.10.4 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 298
16.11Example: Baseball Data (cont’d.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
16.12Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
16.13Example: Baseball Data (cont’d.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
16.14What Does It All Mean?—Effects of Adding Predictors . . . . . . . . . . . . . . . . 304
16.15Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
16.15.1 The Overfitting Problem in Regression . . . . . . . . . . . . . . . . . . . . . . 307
16.15.2 Relation to the Bias-vs.-Variance Tradefoff . . . . . . . . . . . . . . . . . . . 308
16.15.3 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
16.15.4 Methods for Predictor Variable Selection . . . . . . . . . . . . . . . . . . . . 308
16.15.4.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
16.15.4.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
16.15.4.3 Predictive Ability Indicators . . . . . . . . . . . . . . . . . . . . . . 310
CONTENTS xv

16.15.4.4 The LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311


16.15.5 Rough Rules of Thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
16.16Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.16.1 Height/Weight Age Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.16.2 R’s predict() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.17Example: Turkish Teaching Evaluation Data . . . . . . . . . . . . . . . . . . . . . . 313
16.17.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.17.2 Data Prep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.17.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
16.18What About the Assumptions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
16.18.1 Exact Confidence Intervals and Tests . . . . . . . . . . . . . . . . . . . . . . . 317
16.18.2 Is the Homoscedasticity Assumption Important? . . . . . . . . . . . . . . . . 318
16.18.3 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
16.19Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
16.19.1 Example: Prediction of Network RTT . . . . . . . . . . . . . . . . . . . . . . 319
16.19.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
16.19.3 Example: OOP Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

17 Classification 325
17.1 Classification = Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.1.1 What Happens with Regression in the Case Y = 0,1? . . . . . . . . . . . . . 326
17.2 Logistic Regression: a Common Parametric Model for the Regression Function in
Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
17.2.1 The Logistic Model: Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 327
17.2.2 Esimation and Inference for Logit Coefficients . . . . . . . . . . . . . . . . . . 329
17.3 Example: Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
17.3.0.1 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
xvi CONTENTS

17.3.1 Analysis of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331


17.4 Example: Turkish Teaching Evaluation Data . . . . . . . . . . . . . . . . . . . . . . 333
17.5 The Multiclass Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.6 Model Selection in Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.7 What If Y Doesn’t Have a Marginal Distribution? . . . . . . . . . . . . . . . . . . . 333
17.8 Optimality of the Regression Function for 0-1-Valued Y (optional section) . . . . . . 334

18 Nonparametric Estimation of Regression and Classification Functions 337


18.1 Methods Based on Estimating mY ;X (t) . . . . . . . . . . . . . . . . . . . . . . . . . . 337
18.1.1 Nearest-Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
18.1.2 Kernel-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
18.1.3 The Naive Bayes Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
18.2 Methods Based on Estimating Classification Boundaries . . . . . . . . . . . . . . . . 342
18.2.1 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . . . . . . . . . 342
18.2.2 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
18.3 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

A R Quick Start 347


A.1 Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
A.2 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
A.3 First Sample Programming Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
A.4 Second Sample Programming Session . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
A.5 Third Sample Programming Session . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
A.6 Default Argument Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
A.7 The R List Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
A.7.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
A.7.2 The Reduce() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
CONTENTS xvii

A.7.3 S3 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357


A.7.4 Handy Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
A.8 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
A.9 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
A.10 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
A.11 Other Sources for Learning R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
A.12 Online Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
A.13 Debugging in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
A.14 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
A.15 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

B Review of Matrix Algebra 365


B.1 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
B.1.1 Matrix Addition and Multiplication . . . . . . . . . . . . . . . . . . . . . . . 366
B.2 Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
B.3 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
B.4 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
B.5 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
B.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
B.7 Matrix Algebra in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

C Introduction to the ggplot2 Graphics Package 373


C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
C.2 Installation and Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
C.3 Basic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
C.4 Example: Simple Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
C.5 Example: Census Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
xviii CONTENTS

C.6 Function Plots, Density Estimates and Smoothing . . . . . . . . . . . . . . . . . . . 384


C.7 What’s Going on Inside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
C.8 For Further Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Preface

Why is this book different from all other books on mathematical probability and statistics? The key
aspect is the book’s consistently applied approach, especially important for engineering students.
The applied nature comes is manifested in a number of senses. First, there is a strong emphasis
on intution, with less mathematical formalism. In my experience, defining probability via sample
spaces, the standard approach, is a major impediment to doing good applied work. The same holds
for defining expected value as a weighted average. Instead, I use the intuitive, informal approach
of long-run frequency and long-run average. I believe this is especially helpful when explaining
conditional probability and expectation, concepts that students tend to have trouble with. (They
often think they understand until they actually have to work a problem using the concepts.)
On the other hand, in spite of the relative lack of formalism, all models and so on are described
precisely in terms of random variables and distributions. And the material is actually somewhat
more mathematical than most at this level in the sense that it makes extensive usage of linear
algebra.
Second, the book stresses real-world applications. Many similar texts, notably the elegant and
interesting book for computer science students by Mitzenmacher, focus on probability, in fact
discrete probability. Their intended class of “applications” is the theoretical analysis of algorithms.
I instead focus on the actual use of the material in the real world; which tends to be more continuous
than discrete, and more in the realm of statistics than probability. This should prove especially
valuable, as “big data” and machine learning now play a significant role in applications of computers.
Third, there is a strong emphasis on modeling. Considerable emphasis is placed on questions such
as: What do probabilistic models really mean, in real-life terms? How does one choose a model?
How do we assess the practical usefulness of models? This aspect is so important that there is
a separate chapter for this, titled Introduction to Model Building. Throughout the text, there is
considerable discussion of the real-world meaning of probabilistic concepts. For instance, when
probability density functions are introduced, there is an extended discussion regarding the intuitive
meaning of densities in light of the inherently-discrete nature of real data, due to the finite precision
of measurement.

xix
xx CONTENTS

Finally, the R statistical/data analysis language is used throughout. Again, several excellent texts
on probability and statistics have been written that feature R, but this book, by virtue of having a
computer science audience, uses R in a more sophisticated manner. My open source tutorial on R
programming, R for Programmers (http://heather.cs.ucdavis.edu/~matloff/R/RProg.pdf),
can be used as a supplement. (More advanced R programming is covered in my book, The Art of
R Programming, No Starch Press, 2011.)
There is a large amount of material here. For my one-quarter undergraduate course, I usually
cover Chapters 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13 and 16. My lecture style is conversational,
referring to material in the book and making lots of supplementary remarks (“What if we changed
the assumption here to such-and-such?” etc.). Students read the details on their own. For my
one-quarter graduate course, I cover Chapters 8, ??, ??, ??, ??, 14, ??, 16, 17, 18 and ??.
As prerequisites, the student must know calculus, basic matrix algebra, and have some skill in
programming. As with any text in probability and statistics, it is also necessary that the student
has a good sense of math intuition, and does not treat mathematics as simply memorization of
formulas.
The LATEXsource .tex files for this book are in http://heather.cs.ucdavis.edu/~matloff/132/
PLN, so readers can copy the R code and experiment with it. (It is not recommanded to copy-and-
paste from the PDF file, as hidden characters may be copied.) The PDF file is searchable.
The following, among many, provided valuable feedback for which I am very grateful: Ahmed
Ahmedin; Stuart Ambler; Earl Barr; Benjamin Beasley; Matthew Butner; Michael Clifford; Dipak
Ghosal; Noah Gift; Laura Matloff; Nelson Max, Connie Nguyen, Jack Norman, Richard Oehrle,
Yingkang Xie, and Ivana Zetko.
Many of the data sets used in the book are from the UC Irvine Machine Learning Repository, http:
//archive.ics.uci.edu/ml/. Thanks to UCI for making available this very valuable resource.
The book contains a number of references for further reading. Since the audience includes a number
of students at my institution, the University of California, Davis, I often refer to work by current
or former UCD faculty, so that students can see what their professors do in research.
This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 United States
License. The details may be viewed at http://creativecommons.org/licenses/by-nd/3.0/us/,
but in essence it states that you are free to use, copy and distribute the work, but you must
attribute the work to me and not “alter, transform, or build upon” it. If you are using the book,
either in teaching a class or for your own learning, I would appreciate your informing me. I retain
copyright in all non-U.S. jurisdictions, but permission to use these materials in teaching is still
granted, provided the licensing information here is displayed.
Chapter 1

Time Waste Versus Empowerment

I took a course in speed reading, and read War and Peace in 20 minutes. It’s about Russia—
comedian Woody Allen
I learned very early the difference between knowing the name of something and knowing something—
Richard Feynman, Nobel laureate in physics
The main goal [of this course] is self-actualization through the empowerment of claiming your
education—UCSC (and former UCD) professor Marc Mangel, in the syllabus for his calculus course
What does this really mean? Hmm, I’ve never thought about that—UCD PhD student in statistics,
in answer to a student who asked the actual meaning of a very basic concept
d
You have a PhD in mechanical engineering. You may have forgotten technical details like dt sin(t) =
cos(t), but you should at least understand the concepts of rates of change—the author, gently chiding
a friend who was having trouble following a simple quantitative discussion of trends in California’s
educational system

The field of probability and statistics (which, for convenience, I will refer to simply as “statistics”
below) impacts many aspects of our daily lives—business, medicine, the law, government and so
on. Consider just a few examples:

• The statistical models used on Wall Street made the “quants” (quantitative analysts) rich—
but also contributed to the worldwide financial crash of 2008.

• In a court trial, large sums of money or the freedom of an accused may hinge on whether the
judge and jury understand some statistical evidence presented by one side or the other.

• Wittingly or unconsciously, you are using probability every time you gamble in a casino—and

1
2 CHAPTER 1. TIME WASTE VERSUS EMPOWERMENT

every time you buy insurance.

• Statistics is used to determine whether a new medical treatment is safe/effective for you.

• Statistics is used to flag possible terrorists—but sometimes unfairly singling out innocent
people while other times missing ones who really are dangerous.

Clearly, statistics matters. But it only has value when one really understands what it means and
what it does. Indeed, blindly plugging into statistical formulas can be not only valueless but in
fact highly dangerous, say if a bad drug goes onto the market.
Yet most people view statistics as exactly that—mindless plugging into boring formulas. If even
the statistics graduate student quoted above thinks this, how can the students taking the course
be blamed for taking that atititude?
I once had a student who had an unusually good understanding of probability. It turned out that
this was due to his being highly successful at playing online poker, winning lots of cash. No blind
formula-plugging for him! He really had to understand how probability works.
Statistics is not just a bunch of formulas. On the contrary, it can be mathematically deep, for those
who like that kind of thing. (Much of statistics can be viewed as the Pythagorean Theorem in
n-dimensional or even infinite-dimensional space.) But the key point is that anyone who has taken
a calculus course can develop true understanding of statistics, of real practical value. As Professor
Mangel says, that’s empowering.
So as you make your way through this book, always stop to think, “What does this equation really
mean? What is its goal? Why are its ingredients defined in the way they are? Might there be a
better way? How does this relate to our daily lives?” Now THAT is empowering.
Chapter 2

Basic Probability Models

This chapter will introduce the general notions of probability. Most of it will seem intuitive to you,
but pay careful attention to the general principles which are developed; in more complex settings
intuition may not be enough, and the tools discussed here will be very useful.

2.1 ALOHA Network Example

Throughout this book, we will be discussing both “classical” probability examples involving coins,
cards and dice, and also examples involving applications to computer science. The latter will involve
diverse fields such as data mining, machine learning, computer networks, software engineering and
bioinformatics.
In this section, an example from computer networks is presented which will be used at a number
of points in this chapter. Probability analysis is used extensively in the development of new, faster
types of networks.
Today’s Ethernet evolved from an experimental network developed at the University of Hawaii,
called ALOHA. A number of network nodes would occasionally try to use the same radio channel to
communicate with a central computer. The nodes couldn’t hear each other, due to the obstruction
of mountains between them. If only one of them made an attempt to send, it would be successful,
and it would receive an acknowledgement message in response from the central computer. But if
more than one node were to transmit, a collision would occur, garbling all the messages. The
sending nodes would timeout after waiting for an acknowledgement which never came, and try
sending again later. To avoid having too many collisions, nodes would engage in random backoff,
meaning that they would refrain from sending for a while even though they had something to send.
One variation is slotted ALOHA, which divides time into intervals which I will call “epochs.” Each

3
4 CHAPTER 2. BASIC PROBABILITY MODELS

epoch will have duration 1.0, so epoch 1 extends from time 0.0 to 1.0, epoch 2 extends from 1.0 to
2.0 and so on. In the version we will consider here, in each epoch, if a node is active, i.e. has a
message to send, it will either send or refrain from sending, with probability p and 1-p. The value
of p is set by the designer of the network. (Real Ethernet hardware does something like this, using
a random number generator inside the chip.)
The other parameter q in our model is the probability that a node which had been inactive generates
a message during an epoch, i.e. the probability that the user hits a key, and thus becomes “active.”
Think of what happens when you are at a computer. You are not typing constantly, and when you
are not typing, the time until you hit a key again will be random. Our parameter q models that
randomness.
Let n be the number of nodes, which we’ll assume for simplicity is two. Assume also for simplicity
that the timing is as follows. Arrival of a new message happens in the middle of an epoch, and the
decision as to whether to send versus back off is made near the end of an epoch, say 90% into the
epoch.
For example, say that at the beginning of the epoch which extends from time 15.0 to 16.0, node A
has something to send but node B does not. At time 15.5, node B will either generate a message
to send or not, with probability q and 1-q, respectively. Suppose B does generate a new message.
At time 15.9, node A will either try to send or refrain, with probability p and 1-p, and node B will
do the same. Suppose A refrains but B sends. Then B’s transmission will be successful, and at the
start of epoch 16 B will be inactive, while node A will still be active. On the other hand, suppose
both A and B try to send at time 15.9; both will fail, and thus both will be active at time 16.0,
and so on.
Be sure to keep in mind that in our simple model here, during the time a node is active, it won’t
generate any additional new messages.
(Note: The definition of this ALOHA model is summarized concisely on page 10.)
Let’s observe the network for two epochs, epoch 1 and epoch 2. Assume that the network consists
of just two nodes, called node 1 and node 2, both of which start out active. Let X1 and X2 denote
the numbers of active nodes at the very end of epochs 1 and 2, after possible transmissions. We’ll
take p to be 0.4 and q to be 0.8 in this example.
Let’s find P (X1 = 2), the probability that X1 = 2, and then get to the main point, which is to ask
what we really mean by this probability.
How could X1 = 2 occur? There are two possibilities:

• both nodes try to send; this has probability p2

• neither node tries to send; this has probability (1 − p)2


2.2. THE CRUCIAL NOTION OF A REPEATABLE EXPERIMENT 5

1,1 1,2 1,3 1,4 1,5 1,6


2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6

Table 2.1: Sample Space for the Dice Example

Thus

P (X1 = 2) = p2 + (1 − p)2 = 0.52 (2.1)

2.2 The Crucial Notion of a Repeatable Experiment

It’s crucial to understand what that 0.52 figure really means in a practical sense. To this end, let’s
put the ALOHA example aside for a moment, and consider the “experiment” consisting of rolling
two dice, say a blue one and a yellow one. Let X and Y denote the number of dots we get on the
5
blue and yellow dice, respectively, and consider the meaning of P (X + Y = 6) = 36 .
In the mathematical theory of probability, we talk of a sample space, which (in simple cases)
consists of the possible outcomes (X, Y ), seen in Table 2.1. In a theoretical treatment, we place
weights of 1/36 on each of the points in the space, reflecting the fact that each of the 36 points is
5
equally likely, and then say, “What we mean by P (X + Y = 6) = 36 is that the outcomes (1,5),
(2,4), (3,3), (4,2), (5,1) have total weight 5/36.”
Unfortunately, the notion of sample space becomes mathematically tricky when developed for more
complex probability models. Indeed, it requires graduate-level math. And much worse, one loses all
the intuition. In any case, most probability computations do not rely on explicitly writing down a
sample space. In this particular example it is useful for us as a vehicle for explaining the concepts,
but we will NOT use it much. Those who wish to get a more theoretical grounding can get a start
in Section 3.22.
5
But the intuitive notion—which is FAR more important—of what P (X + Y = 6) = 36 means is
the following. Imagine doing the experiment many, many times, recording the results in a large
notebook:
6 CHAPTER 2. BASIC PROBABILITY MODELS

notebook line outcome blue+yellow = 6?


1 blue 2, yellow 6 No
2 blue 3, yellow 1 No
3 blue 1, yellow 1 No
4 blue 4, yellow 2 Yes
5 blue 1, yellow 1 No
6 blue 3, yellow 4 No
7 blue 5, yellow 1 Yes
8 blue 3, yellow 6 No
9 blue 2, yellow 5 No

Table 2.2: Notebook for the Dice Problem

• Roll the dice the first time, and write the outcome on the first line of the notebook.

• Roll the dice the second time, and write the outcome on the second line of the notebook.

• Roll the dice the third time, and write the outcome on the third line of the notebook.

• Roll the dice the fourth time, and write the outcome on the fourth line of the notebook.

• Imagine you keep doing this, thousands of times, filling thousands of lines in the notebook.

The first 9 lines of the notebook might look like Table 2.2. Here 2/9 of these lines say Yes. But
after many, many repetitions, approximately 5/36 of the lines will say Yes. For example, after
5
doing the experiment 720 times, approximately 36 × 720 = 100 lines will say Yes.
This is what probability really is: In what fraction of the lines does the event of interest happen?
It sounds simple, but if you always think about this “lines in the notebook” idea,
probability problems are a lot easier to solve. And it is the fundamental basis of computer
simulation.

2.3 Our Definitions

These definitions are intuitive, rather than rigorous math, but intuition is what we need. Keep in
mind that we are making definitions below, not listing properties.
2.3. OUR DEFINITIONS 7

• We assume an “experiment” which is (at least in concept) repeatable. The experiment of


rolling two dice is repeatable, and even the ALOHA experiment is so. (We simply watch the
network for a long time, collecting data on pairs of consecutive epochs in which there are
two active stations at the beginning.) On the other hand, the econometricians, in forecasting
2009, cannot “repeat” 2008. Yet all of the econometricians’ tools assume that events in 2008
were affected by various sorts of randomness, and we think of repeating the experiment in a
conceptual sense.

• We imagine performing the experiment a large number of times, recording the result of each
repetition on a separate line in a notebook.

• We say A is an event for this experiment if it is a possible boolean (i.e. yes-or-no) outcome
of the experiment. In the above example, here are some events:

* X+Y = 6
* X=1
* Y=3
* X-Y = 4

• A random variable is a numerical outcome of the experiment, such as X and Y here, as


well as X+Y, 2XY and even sin(XY).

• For any event of interest A, imagine a column on A in the notebook. The k th line in the
notebook, k = 1,2,3,..., will say Yes or No, depending on whether A occurred or not during
the k th repetition of the experiment. For instance, we have such a column in our table above,
for the event {A = blue+yellow = 6}.

• For any event of interest A, we define P(A) to be the long-run fraction of lines with Yes
entries.

• For any events A, B, imagine a new column in our notebook, labeled “A and B.” In each line,
this column will say Yes if and only if there are Yes entries for both A and B. P(A and B) is
then the long-run fraction of lines with Yes entries in the new column labeled “A and B.”1

• For any events A, B, imagine a new column in our notebook, labeled “A or B.” In each line,
this column will say Yes if and only if at least one of the entries for A and B says Yes.2

• For any events A, B, imagine a new column in our notebook, labeled “A | B” and pronounced
“A given B.” In each line:
1
In most textbooks, what we call “A and B” here is written A∩B, indicating the intersection of two sets in the
sample space. But again, we do not take a sample space point of view here.
2
In the sample space approach, this is written A ∪ B.
Another random document with
no related content on Scribd:
Schoetensack, O., 73
Semper, G., 127
Sergi, G., 40
Serres, Marcel de, 117
Smith, W. Robertson, 142
Sollas, W. J., 75, 77, 125
Spencer, B., 142
Spencer, H., 62, 85, 132
Sperling, J., 17
Stolpe, H., 127
Strabo, 102

Tarde, G., 135


Thurnam, J., 123
Topinard, P., 35, 37, 93, 108
Tylor, E. B., 99, 104, 129 sq., 141 sq.
Tyson, E., 17 sqq.

Vesalius, 15
Virchow, R. L. K., 39, 44, 71, 123
Virey, J. J., 53, 91
Vogt, C., 65, 96

Waitz, F. T., 92, 95, 107, 151


Westermarck, E., 135, 143
White, C., 43
Windle, B. C. A., 19
Worsaae, J. J. A., 115
Wundt, W., 86
Wilde, W. R., 111, 118 sq.
Transcriber’s Notes:
Footnotes were moved to the end of the
referring paragraph, where necessary.
Sidenotes were moved to the beginning of the
paragraph they were found in.
Missing or obscured punctuation was silently
corrected.
Typographical errors were silently corrected.
Inconsistent spelling and hyphenation were
made consistent only when a predominant form
was found in this book.
*** END OF THE PROJECT GUTENBERG EBOOK HISTORY OF
ANTHROPOLOGY ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright
in these works, so the Foundation (and you!) can copy and
distribute it in the United States without permission and without
paying copyright royalties. Special rules, set forth in the General
Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree to
abide by all the terms of this agreement, you must cease using
and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project
Gutenberg™ works in compliance with the terms of this
agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms
of this agreement by keeping this work in the same format with
its attached full Project Gutenberg™ License when you share it
without charge with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it
away or re-use it under the terms of the Project Gutenberg
License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country where
you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of the
copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite
these efforts, Project Gutenberg™ electronic works, and the
medium on which they may be stored, may contain “Defects,”
such as, but not limited to, incomplete, inaccurate or corrupt
data, transcription errors, a copyright or other intellectual
property infringement, a defective or damaged disk or other
medium, a computer virus, or computer codes that damage or
cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES -


Except for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU
AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE,
STRICT LIABILITY, BREACH OF WARRANTY OR BREACH
OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE
TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER
THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR
ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE
OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF
THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If


you discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person or
entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR
ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you do
or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

You might also like