PDF From Algorithms To Z Scores Matloff N Ebook Full Chapter
PDF From Algorithms To Z Scores Matloff N Ebook Full Chapter
PDF From Algorithms To Z Scores Matloff N Ebook Full Chapter
https://textbookfull.com/product/mergers-and-acquisitions-from-a-
to-z-andrew-j-sherman/
https://textbookfull.com/product/evaluation-essentials-second-
edition-from-a-to-z-marvin-c-alkin-anne-t-vo/
https://textbookfull.com/product/mathematics-of-epidemics-on-
networks-from-exact-to-approximate-models-1st-edition-istvan-z-
kiss/
https://textbookfull.com/product/the-crypto-encyclopedia-coins-
tokens-and-digital-assets-from-a-to-z-1st-edition-schueffel/
Introduction to Electromagnetism: From Coulomb to
Maxwell 2nd Edition Martin J N Sibley
https://textbookfull.com/product/introduction-to-
electromagnetism-from-coulomb-to-maxwell-2nd-edition-martin-j-n-
sibley/
https://textbookfull.com/product/archery-from-a-to-z-an-
introductory-guide-to-a-sport-everyone-can-enjoy-1st-edition-
christian-berg/
https://textbookfull.com/product/glowworm-swarm-optimization-
theory-algorithms-and-applications-1st-edition-krishnanand-n-
kaipa/
https://textbookfull.com/product/margin-trading-from-a-to-z-a-
complete-guide-to-borrowing-investing-and-regulation-1st-edition-
curley-michael-t/
https://textbookfull.com/product/an-introduction-to-
interdisciplinary-toxicology-from-molecules-to-man-1st-edition-
carey-n-pope-editor/
From Algorithms to Z-Scores:
Probabilistic and Statistical Modeling in
Computer Science
Norm Matloff, University of California, Davis
library(MASS)
−0.5(t−µ)0 Σ−1 (t−µ)
fX (t) = ce x <- mvrnorm(mu,sgm)
0.015
0.010
z
0.005
10
5
−10 0
−5 x2
0 −5
x1 5
10 −10
Dr. Norm Matloff is a professor of computer science at the University of California at Davis, and
was formerly a professor of statistics at that university. He is a former database software developer
in Silicon Valley, and has been a statistical consultant for firms such as the Kaiser Permanente
Health Plan.
Dr. Matloff was born in Los Angeles, and grew up in East Los Angeles and the San Gabriel Valley.
He has a PhD in pure mathematics from UCLA, specializing in probability theory and statistics. He
has published numerous papers in computer science and statistics, with current research interests
in parallel processing, statistical computing, and regression methodology.
Prof. Matloff is a former appointed member of IFIP Working Group 11.3, an international com-
mittee concerned with database software security, established under UNESCO. He was a founding
member of the UC Davis Department of Statistics, and participated in the formation of the UCD
Computer Science Department as well. He is a recipient of the campuswide Distinguished Teaching
Award and Distinguished Public Service Award at UC Davis.
Dr. Matloff is the author of two published textbooks, and of a number of widely-used Web tutorials
on computer topics, such as the Linux operating system and the Python programming language.
He and Dr. Peter Salzman are authors of The Art of Debugging with GDB, DDD, and Eclipse.
Prof. Matloff’s book on the R programming language, The Art of R Programming, was published in
2011. His book, Parallel Computation for Data Science, will come out in early 2015. He is also the
author of several open-source textbooks, including From Algorithms to Z-Scores: Probabilistic and
Statistical Modeling in Computer Science (http://heather.cs.ucdavis.edu/probstatbook), and
Programming on Parallel Machines (http://heather.cs.ucdavis.edu/~matloff/ParProcBook.
pdf).
Contents
i
ii CONTENTS
17 Classification 325
17.1 Classification = Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.1.1 What Happens with Regression in the Case Y = 0,1? . . . . . . . . . . . . . 326
17.2 Logistic Regression: a Common Parametric Model for the Regression Function in
Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
17.2.1 The Logistic Model: Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 327
17.2.2 Esimation and Inference for Logit Coefficients . . . . . . . . . . . . . . . . . . 329
17.3 Example: Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
17.3.0.1 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
xvi CONTENTS
Why is this book different from all other books on mathematical probability and statistics? The key
aspect is the book’s consistently applied approach, especially important for engineering students.
The applied nature comes is manifested in a number of senses. First, there is a strong emphasis
on intution, with less mathematical formalism. In my experience, defining probability via sample
spaces, the standard approach, is a major impediment to doing good applied work. The same holds
for defining expected value as a weighted average. Instead, I use the intuitive, informal approach
of long-run frequency and long-run average. I believe this is especially helpful when explaining
conditional probability and expectation, concepts that students tend to have trouble with. (They
often think they understand until they actually have to work a problem using the concepts.)
On the other hand, in spite of the relative lack of formalism, all models and so on are described
precisely in terms of random variables and distributions. And the material is actually somewhat
more mathematical than most at this level in the sense that it makes extensive usage of linear
algebra.
Second, the book stresses real-world applications. Many similar texts, notably the elegant and
interesting book for computer science students by Mitzenmacher, focus on probability, in fact
discrete probability. Their intended class of “applications” is the theoretical analysis of algorithms.
I instead focus on the actual use of the material in the real world; which tends to be more continuous
than discrete, and more in the realm of statistics than probability. This should prove especially
valuable, as “big data” and machine learning now play a significant role in applications of computers.
Third, there is a strong emphasis on modeling. Considerable emphasis is placed on questions such
as: What do probabilistic models really mean, in real-life terms? How does one choose a model?
How do we assess the practical usefulness of models? This aspect is so important that there is
a separate chapter for this, titled Introduction to Model Building. Throughout the text, there is
considerable discussion of the real-world meaning of probabilistic concepts. For instance, when
probability density functions are introduced, there is an extended discussion regarding the intuitive
meaning of densities in light of the inherently-discrete nature of real data, due to the finite precision
of measurement.
xix
xx CONTENTS
Finally, the R statistical/data analysis language is used throughout. Again, several excellent texts
on probability and statistics have been written that feature R, but this book, by virtue of having a
computer science audience, uses R in a more sophisticated manner. My open source tutorial on R
programming, R for Programmers (http://heather.cs.ucdavis.edu/~matloff/R/RProg.pdf),
can be used as a supplement. (More advanced R programming is covered in my book, The Art of
R Programming, No Starch Press, 2011.)
There is a large amount of material here. For my one-quarter undergraduate course, I usually
cover Chapters 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13 and 16. My lecture style is conversational,
referring to material in the book and making lots of supplementary remarks (“What if we changed
the assumption here to such-and-such?” etc.). Students read the details on their own. For my
one-quarter graduate course, I cover Chapters 8, ??, ??, ??, ??, 14, ??, 16, 17, 18 and ??.
As prerequisites, the student must know calculus, basic matrix algebra, and have some skill in
programming. As with any text in probability and statistics, it is also necessary that the student
has a good sense of math intuition, and does not treat mathematics as simply memorization of
formulas.
The LATEXsource .tex files for this book are in http://heather.cs.ucdavis.edu/~matloff/132/
PLN, so readers can copy the R code and experiment with it. (It is not recommanded to copy-and-
paste from the PDF file, as hidden characters may be copied.) The PDF file is searchable.
The following, among many, provided valuable feedback for which I am very grateful: Ahmed
Ahmedin; Stuart Ambler; Earl Barr; Benjamin Beasley; Matthew Butner; Michael Clifford; Dipak
Ghosal; Noah Gift; Laura Matloff; Nelson Max, Connie Nguyen, Jack Norman, Richard Oehrle,
Yingkang Xie, and Ivana Zetko.
Many of the data sets used in the book are from the UC Irvine Machine Learning Repository, http:
//archive.ics.uci.edu/ml/. Thanks to UCI for making available this very valuable resource.
The book contains a number of references for further reading. Since the audience includes a number
of students at my institution, the University of California, Davis, I often refer to work by current
or former UCD faculty, so that students can see what their professors do in research.
This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 United States
License. The details may be viewed at http://creativecommons.org/licenses/by-nd/3.0/us/,
but in essence it states that you are free to use, copy and distribute the work, but you must
attribute the work to me and not “alter, transform, or build upon” it. If you are using the book,
either in teaching a class or for your own learning, I would appreciate your informing me. I retain
copyright in all non-U.S. jurisdictions, but permission to use these materials in teaching is still
granted, provided the licensing information here is displayed.
Chapter 1
I took a course in speed reading, and read War and Peace in 20 minutes. It’s about Russia—
comedian Woody Allen
I learned very early the difference between knowing the name of something and knowing something—
Richard Feynman, Nobel laureate in physics
The main goal [of this course] is self-actualization through the empowerment of claiming your
education—UCSC (and former UCD) professor Marc Mangel, in the syllabus for his calculus course
What does this really mean? Hmm, I’ve never thought about that—UCD PhD student in statistics,
in answer to a student who asked the actual meaning of a very basic concept
d
You have a PhD in mechanical engineering. You may have forgotten technical details like dt sin(t) =
cos(t), but you should at least understand the concepts of rates of change—the author, gently chiding
a friend who was having trouble following a simple quantitative discussion of trends in California’s
educational system
The field of probability and statistics (which, for convenience, I will refer to simply as “statistics”
below) impacts many aspects of our daily lives—business, medicine, the law, government and so
on. Consider just a few examples:
• The statistical models used on Wall Street made the “quants” (quantitative analysts) rich—
but also contributed to the worldwide financial crash of 2008.
• In a court trial, large sums of money or the freedom of an accused may hinge on whether the
judge and jury understand some statistical evidence presented by one side or the other.
• Wittingly or unconsciously, you are using probability every time you gamble in a casino—and
1
2 CHAPTER 1. TIME WASTE VERSUS EMPOWERMENT
• Statistics is used to determine whether a new medical treatment is safe/effective for you.
• Statistics is used to flag possible terrorists—but sometimes unfairly singling out innocent
people while other times missing ones who really are dangerous.
Clearly, statistics matters. But it only has value when one really understands what it means and
what it does. Indeed, blindly plugging into statistical formulas can be not only valueless but in
fact highly dangerous, say if a bad drug goes onto the market.
Yet most people view statistics as exactly that—mindless plugging into boring formulas. If even
the statistics graduate student quoted above thinks this, how can the students taking the course
be blamed for taking that atititude?
I once had a student who had an unusually good understanding of probability. It turned out that
this was due to his being highly successful at playing online poker, winning lots of cash. No blind
formula-plugging for him! He really had to understand how probability works.
Statistics is not just a bunch of formulas. On the contrary, it can be mathematically deep, for those
who like that kind of thing. (Much of statistics can be viewed as the Pythagorean Theorem in
n-dimensional or even infinite-dimensional space.) But the key point is that anyone who has taken
a calculus course can develop true understanding of statistics, of real practical value. As Professor
Mangel says, that’s empowering.
So as you make your way through this book, always stop to think, “What does this equation really
mean? What is its goal? Why are its ingredients defined in the way they are? Might there be a
better way? How does this relate to our daily lives?” Now THAT is empowering.
Chapter 2
This chapter will introduce the general notions of probability. Most of it will seem intuitive to you,
but pay careful attention to the general principles which are developed; in more complex settings
intuition may not be enough, and the tools discussed here will be very useful.
Throughout this book, we will be discussing both “classical” probability examples involving coins,
cards and dice, and also examples involving applications to computer science. The latter will involve
diverse fields such as data mining, machine learning, computer networks, software engineering and
bioinformatics.
In this section, an example from computer networks is presented which will be used at a number
of points in this chapter. Probability analysis is used extensively in the development of new, faster
types of networks.
Today’s Ethernet evolved from an experimental network developed at the University of Hawaii,
called ALOHA. A number of network nodes would occasionally try to use the same radio channel to
communicate with a central computer. The nodes couldn’t hear each other, due to the obstruction
of mountains between them. If only one of them made an attempt to send, it would be successful,
and it would receive an acknowledgement message in response from the central computer. But if
more than one node were to transmit, a collision would occur, garbling all the messages. The
sending nodes would timeout after waiting for an acknowledgement which never came, and try
sending again later. To avoid having too many collisions, nodes would engage in random backoff,
meaning that they would refrain from sending for a while even though they had something to send.
One variation is slotted ALOHA, which divides time into intervals which I will call “epochs.” Each
3
4 CHAPTER 2. BASIC PROBABILITY MODELS
epoch will have duration 1.0, so epoch 1 extends from time 0.0 to 1.0, epoch 2 extends from 1.0 to
2.0 and so on. In the version we will consider here, in each epoch, if a node is active, i.e. has a
message to send, it will either send or refrain from sending, with probability p and 1-p. The value
of p is set by the designer of the network. (Real Ethernet hardware does something like this, using
a random number generator inside the chip.)
The other parameter q in our model is the probability that a node which had been inactive generates
a message during an epoch, i.e. the probability that the user hits a key, and thus becomes “active.”
Think of what happens when you are at a computer. You are not typing constantly, and when you
are not typing, the time until you hit a key again will be random. Our parameter q models that
randomness.
Let n be the number of nodes, which we’ll assume for simplicity is two. Assume also for simplicity
that the timing is as follows. Arrival of a new message happens in the middle of an epoch, and the
decision as to whether to send versus back off is made near the end of an epoch, say 90% into the
epoch.
For example, say that at the beginning of the epoch which extends from time 15.0 to 16.0, node A
has something to send but node B does not. At time 15.5, node B will either generate a message
to send or not, with probability q and 1-q, respectively. Suppose B does generate a new message.
At time 15.9, node A will either try to send or refrain, with probability p and 1-p, and node B will
do the same. Suppose A refrains but B sends. Then B’s transmission will be successful, and at the
start of epoch 16 B will be inactive, while node A will still be active. On the other hand, suppose
both A and B try to send at time 15.9; both will fail, and thus both will be active at time 16.0,
and so on.
Be sure to keep in mind that in our simple model here, during the time a node is active, it won’t
generate any additional new messages.
(Note: The definition of this ALOHA model is summarized concisely on page 10.)
Let’s observe the network for two epochs, epoch 1 and epoch 2. Assume that the network consists
of just two nodes, called node 1 and node 2, both of which start out active. Let X1 and X2 denote
the numbers of active nodes at the very end of epochs 1 and 2, after possible transmissions. We’ll
take p to be 0.4 and q to be 0.8 in this example.
Let’s find P (X1 = 2), the probability that X1 = 2, and then get to the main point, which is to ask
what we really mean by this probability.
How could X1 = 2 occur? There are two possibilities:
Thus
It’s crucial to understand what that 0.52 figure really means in a practical sense. To this end, let’s
put the ALOHA example aside for a moment, and consider the “experiment” consisting of rolling
two dice, say a blue one and a yellow one. Let X and Y denote the number of dots we get on the
5
blue and yellow dice, respectively, and consider the meaning of P (X + Y = 6) = 36 .
In the mathematical theory of probability, we talk of a sample space, which (in simple cases)
consists of the possible outcomes (X, Y ), seen in Table 2.1. In a theoretical treatment, we place
weights of 1/36 on each of the points in the space, reflecting the fact that each of the 36 points is
5
equally likely, and then say, “What we mean by P (X + Y = 6) = 36 is that the outcomes (1,5),
(2,4), (3,3), (4,2), (5,1) have total weight 5/36.”
Unfortunately, the notion of sample space becomes mathematically tricky when developed for more
complex probability models. Indeed, it requires graduate-level math. And much worse, one loses all
the intuition. In any case, most probability computations do not rely on explicitly writing down a
sample space. In this particular example it is useful for us as a vehicle for explaining the concepts,
but we will NOT use it much. Those who wish to get a more theoretical grounding can get a start
in Section 3.22.
5
But the intuitive notion—which is FAR more important—of what P (X + Y = 6) = 36 means is
the following. Imagine doing the experiment many, many times, recording the results in a large
notebook:
6 CHAPTER 2. BASIC PROBABILITY MODELS
• Roll the dice the first time, and write the outcome on the first line of the notebook.
• Roll the dice the second time, and write the outcome on the second line of the notebook.
• Roll the dice the third time, and write the outcome on the third line of the notebook.
• Roll the dice the fourth time, and write the outcome on the fourth line of the notebook.
• Imagine you keep doing this, thousands of times, filling thousands of lines in the notebook.
The first 9 lines of the notebook might look like Table 2.2. Here 2/9 of these lines say Yes. But
after many, many repetitions, approximately 5/36 of the lines will say Yes. For example, after
5
doing the experiment 720 times, approximately 36 × 720 = 100 lines will say Yes.
This is what probability really is: In what fraction of the lines does the event of interest happen?
It sounds simple, but if you always think about this “lines in the notebook” idea,
probability problems are a lot easier to solve. And it is the fundamental basis of computer
simulation.
These definitions are intuitive, rather than rigorous math, but intuition is what we need. Keep in
mind that we are making definitions below, not listing properties.
2.3. OUR DEFINITIONS 7
• We imagine performing the experiment a large number of times, recording the result of each
repetition on a separate line in a notebook.
• We say A is an event for this experiment if it is a possible boolean (i.e. yes-or-no) outcome
of the experiment. In the above example, here are some events:
* X+Y = 6
* X=1
* Y=3
* X-Y = 4
• For any event of interest A, imagine a column on A in the notebook. The k th line in the
notebook, k = 1,2,3,..., will say Yes or No, depending on whether A occurred or not during
the k th repetition of the experiment. For instance, we have such a column in our table above,
for the event {A = blue+yellow = 6}.
• For any event of interest A, we define P(A) to be the long-run fraction of lines with Yes
entries.
• For any events A, B, imagine a new column in our notebook, labeled “A and B.” In each line,
this column will say Yes if and only if there are Yes entries for both A and B. P(A and B) is
then the long-run fraction of lines with Yes entries in the new column labeled “A and B.”1
• For any events A, B, imagine a new column in our notebook, labeled “A or B.” In each line,
this column will say Yes if and only if at least one of the entries for A and B says Yes.2
• For any events A, B, imagine a new column in our notebook, labeled “A | B” and pronounced
“A given B.” In each line:
1
In most textbooks, what we call “A and B” here is written A∩B, indicating the intersection of two sets in the
sample space. But again, we do not take a sample space point of view here.
2
In the sample space approach, this is written A ∪ B.
Another random document with
no related content on Scribd:
Schoetensack, O., 73
Semper, G., 127
Sergi, G., 40
Serres, Marcel de, 117
Smith, W. Robertson, 142
Sollas, W. J., 75, 77, 125
Spencer, B., 142
Spencer, H., 62, 85, 132
Sperling, J., 17
Stolpe, H., 127
Strabo, 102
Vesalius, 15
Virchow, R. L. K., 39, 44, 71, 123
Virey, J. J., 53, 91
Vogt, C., 65, 96
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.