PDF Machine Learning Refined Foundations Algorithms and Applications Second Edition Borhani Ebook Full Chapter

Machine learning refined foundations
algorithms and applications Second

Edition Borhani
Visit to download the full and correct content document:
https://textbookfull.com/product/machine-learning-refined-foundations-algorithms-and-
applications-second-edition-borhani/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
Machine Learning Refined: Foundations, Algorithms, and

Applications Second Edition Jeremy Watt
https://textbookfull.com/product/machine-learning-refined-
foundations-algorithms-and-applications-second-edition-jeremy-
watt/
Foundations of Machine Learning second edition Mehryar

Mohri
https://textbookfull.com/product/foundations-of-machine-learning-
second-edition-mehryar-mohri/
Machine Learning Algorithms for Industrial Applications

Santosh Kumar Das
https://textbookfull.com/product/machine-learning-algorithms-for-
industrial-applications-santosh-kumar-das/
Machine Learning Foundations: Supervised, Unsupervised,

and Advanced Learning Taeho Jo
https://textbookfull.com/product/machine-learning-foundations-
supervised-unsupervised-and-advanced-learning-taeho-jo/
Pro Machine Learning Algorithms V Kishore Ayyadevara
https://textbookfull.com/product/pro-machine-learning-algorithms-
v-kishore-ayyadevara/
Analysis for computer scientists foundations methods

and algorithms Second Edition Oberguggenberger
https://textbookfull.com/product/analysis-for-computer-
scientists-foundations-methods-and-algorithms-second-edition-
oberguggenberger/
Learning Microsoft Cognitive Services leverage machine

learning APIs to build smart applications Second
Edition. Edition Larsen
https://textbookfull.com/product/learning-microsoft-cognitive-
services-leverage-machine-learning-apis-to-build-smart-
applications-second-edition-edition-larsen/
Machine learning and security protecting systems with

data and algorithms First Edition Chio
https://textbookfull.com/product/machine-learning-and-security-
protecting-systems-with-data-and-algorithms-first-edition-chio/
Machine Learning and its Applications 1st Edition Peter

Wlodarczak
https://textbookfull.com/product/machine-learning-and-its-
applications-1st-edition-peter-wlodarczak/
Machine Learning Refined
With its intuitive yet rigorous approach to machine learning, this text provides students
with the fundamental knowledge and practical tools needed to conduct research and
build data-driven products. The authors prioritize geometric intuition and algorithmic
thinking, and include detail on all the essential mathematical prerequisites, to offer a
fresh and accessible way to learn. Practical applications are emphasized, with examples
from disciplines including computer vision, natural language processing, economics,
neuroscience, recommender systems, physics, and biology. Over 300 color illustra-
tions are included and have been meticulously designed to enable an intuitive grasp
of technical concepts, and over 100 in-depth coding exercisesPython
(in ) provide a
real understanding of crucial machine learning algorithms. A suite of online resources
including sample code, data sets, interactive lecture slides, and a solutions manual are
provided online, making this an ideal text both for graduate courses on machine learning
and for individual reference and self-study.
Jeremy Watt received his PhD in Electrical Engineering from Northwestern University,
and is now a machine learning consultant and educator. He teaches machine learning,
deep learning, mathematical optimization, and reinforcement learning at Northwestern
University.
Reza Borhani received his PhD in Electrical Engineering from Northwestern University,
and is now a machine learning consultant and educator. He teaches a variety of courses
in machine learning and deep learning at Northwestern University.
Aggelos K. Katsaggelos is the Joseph Cummings Professor at Northwestern University,
where he heads the Image and Video Processing Laboratory. He is a Fellow of IEEE,
SPIE, EURASIP, and OSA and the recipient of the IEEE Third Millennium Medal
(2000).
Machine Learning Refined
Foundations, Algorithms, and Applications
J E R E M Y W AT T
Northwestern University, Illinois
REZA BORHANI
A G G E L O S K . K AT S A G G E L O S
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title:
www.cambridge.org/9781108480727
DOI: 10.1017/9781108690935
© Cambridge University Press 2020
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2020
Printed and bound in Great Britain by Clays Ltd, Elcograf S.p.A.
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-48072-7 Hardback
Additional resources for this publication www.cambridge.org/watt2
at
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To our families:
Deb, Robert, and Terri
Soheila, Ali, and Maryam
Ειρήνη Ζωή Σοφία

, , , and Ειρήνη
Contents
Preface pagexii
Acknowledgements xxii
1 Introduction to Machine Learning 1
1.1 Introduction 1
1.2 Distinguishing Cats from Dogs: a Machine Learning Approach 1
1.3 The Basic Taxonomy of Machine Learning Problems 6
1.4 Mathematical Optimization 16
1.5 Conclusion 18
Part I Mathematical Optimization 19
2 Zero-Order Optimization Techniques 21
2.1 Introduction 21
2.2 The Zero-Order Optimality Condition 23
2.3 Global Optimization Methods 24
2.4 Local Optimization Methods 27
2.5 Random Search 31
2.6 Coordinate Search and Descent 39
2.7 Conclusion 40
2.8 Exercises 42
3 First-Order Optimization Techniques 45
3.1 Introduction 45
3.2 The First-Order Optimality Condition 45
3.3 The Geometry of First-Order Taylor Series 52
3.4 Computing Gradients Efficiently 55
3.5 Gradient Descent 56
3.6 Two Natural Weaknesses of Gradient Descent 65
3.7 Conclusion 71
3.8 Exercises 71
4 Second-Order Optimization Techniques 75
4.1 The Second-Order Optimality Condition 75
viii Contents
4.2 The Geometry of Second-Order Taylor Series 78

4.3 Newton’s Method 81
4.4 Two Natural Weaknesses of Newton’s Method 90
4.5 Conclusion 91
4.6 Exercises 92
Part II Linear Learning 97
5 Linear Regression 99
5.1 Introduction 99
5.2 Least Squares Linear Regression 99
5.3 Least Absolute Deviations 108
5.4 Regression Quality Metrics 111
5.5 Weighted Regression 113
5.6 Multi-Output Regression 116
5.7 Conclusion 120
5.8 Exercises 121
5.9 Endnotes 124
6 Linear Two-Class Classification 125
6.1 Introduction 125
6.2 Logistic Regression and the Cross Entropy Cost 125
6.3 Logistic Regression and the Softmax Cost 135
6.4 The Perceptron 140
6.5 Support Vector Machines 150
6.6 Which Approach Produces the Best Results? 157
6.7 The Categorical Cross Entropy Cost 158
6.8 Classification Quality Metrics 160
6.9 Weighted Two-Class Classification 167
6.10 Conclusion 170
6.11 Exercises 171
7 Linear Multi-Class Classification 174
7.2 One-versus-All Multi-Class Classification 174
7.3 Multi-Class Classification and the Perceptron 184
7.4 Which Approach Produces the Best Results? 192
7.5 The Categorical Cross Entropy Cost Function 193
7.6 Classification Quality Metrics 198
7.7 Weighted Multi-Class Classification 202
7.8 Stochastic and Mini-Batch Learning 203
7.9 Conclusion 205
7.10 Exercises 205
Contents ix
8 Linear Unsupervised Learning 208
8.2 Fixed Spanning Sets, Orthonormality, and Projections 208
8.3 The Linear Autoencoder and Principal Component Analysis 213
8.4 Recommender Systems 219
8.5 K-Means Clustering 221
8.6 General Matrix Factorization Techniques 227
8.7 Conclusion 230
8.8 Exercises 231
8.9 Endnotes 233
9 Feature Engineering and Selection 237
9.2 Histogram Features 238
9.3 Feature Scaling via Standard Normalization 249
9.4 Imputing Missing Values in a Dataset 254
9.5 Feature Scaling via PCA-Sphering 255
9.6 Feature Selection via Boosting 258
9.7 Feature Selection via Regularization 264
9.8 Conclusion 269
9.9 Exercises 269
Part III Nonlinear Learning 273
10 Principles of Nonlinear Feature Engineering 275
10.2 Nonlinear Regression 275
10.3 Nonlinear Multi-Output Regression 282
10.4 Nonlinear Two-Class Classification 286
10.5 Nonlinear Multi-Class Classification 290
10.6 Nonlinear Unsupervised Learning 294
10.7 Conclusion 298
10.8 Exercises 298
11 Principles of Feature Learning 304
11.2 Universal Approximators 307
11.3 Universal Approximation of Real Data 323
ffi
11.4 Naive Cross-Validation 335
11.5 E cient Cross-Validation via Boosting 340
11.6 Effi cient Cross-Validation via Regularization 350
11.7 Testing Data 361
11.8 Which Universal Approximator Works Best in Practice? 365
11.9 Bagging Cross-Validated Models 366

x Contents
11.10 K-Fold Cross-Validation 373

11.11 When Feature Learning Fails 378
11.12 Conclusion 379
11.13 Exercises 380
12 Kernel Methods 383
12.2 Fixed-Shape Universal Approximators 383
12.3 The Kernel Trick 386
12.4 Kernels as Measures of Similarity 396
12.5 Optimization of Kernelized Models 397
12.6 Cross-Validating Kernelized Learners 398
12.7 Conclusion 399
12.8 Exercises 399
13 Fully Connected Neural Networks 403
13.2 Fully Connected Neural Networks 403
13.3 Activation Functions 424
13.4 The Backpropagation Algorithm 427
13.5 Optimization of Neural Network Models 428
13.6 Batch Normalization 430
13.7 Cross-Validation via Early Stopping 438
13.8 Conclusion 440
13.9 Exercises 441
14 Tree-Based Learners 443
14.2 From Stumps to Deep Trees 443
14.3 Regression Trees 446
14.4 Classification Trees 452
14.5 Gradient Boosting 458
14.6 Random Forests 462
14.7 Cross-Validation Techniques for Recursively Defined Trees 464
14.8 Conclusion 467
14.9 Exercises 467
Part IV Appendices 471
Appendix A Advanced First- and Second-Order Optimization Methods 473
A.1 Introduction 473
A.2 Momentum-Accelerated Gradient Descent 473
A.3 Normalized Gradient Descent 478
A.4 Advanced Gradient-Based Methods 485
Contents xi
A.5 Mini-Batch Optimization 487
A.6 Conservative Steplength Rules 490
A.7 Newton’s Method, Regularization, and Nonconvex Functions 499
A.8 Hessian-Free Methods 502
Appendix B Derivatives and Automatic Differentiation 511
B.1 Introduction 511
B.2 The Derivative 511
B.3 Derivative Rules for Elementary Functions and Operations 514
B.4 The Gradient 516
B.5 The Computation Graph 517
B.6 The Forward Mode of Automatic Di fferentiation 520
B.7 The Reverse Mode of Automatic Differentiation 526
B.8 Higher-Order Derivatives 529
B.9 Taylor Series 531
B.10 Using the autograd Library 536
Appendix C Linear Algebra 546
C.1 Introduction 546
C.2 Vectors and Vector Operations 546
C.3 Matrices and Matrix Operations 553
C.4 Eigenvalues and Eigenvectors 556
C.5 Vector and Matrix Norms 559
References 564
Index 569
Preface
For eons we humans have sought out rules or patterns that accurately describe
how important systems in the world around us work, whether these systems
be agricultural, biological, physical, financial, etc. We do this because such rules
allow us to understand a system better, accurately predict its future behavior
and ultimately, control it. However, the process of finding the ”right” rule that
seems to govern a given system has historically been no easy task. For most of
our history data (glimpses of a given system at work) has been an extremely
scarce commodity. Moreover, our ability to compute, to try out various rules
to see which most accurately represents a phenomenon, has been limited to
what we could accomplish by hand. Both of these factors naturally limited
the range of phenomena scientific pioneers of the past could investigate and
inevitably forced them to use philosophical and /or visual approaches to rule-
finding. Today, however, we live in a world awash in data, and have colossal
computing power at our fingertips. Because of this, we lucky descendants of the
great pioneers can tackle a much wider array of problems and take a much more
empirical approach to rule-finding than our forbears could. Machine learning,
the topic of this textbook, is a term used to describe a broad (and growing)
collection of pattern-finding algorithms designed to properly identify system
rules empirically and by leveraging our access to potentially enormous amounts
of data and computing power.
In the past decade the user base of machine learning has grown dramatically.
From a relatively small circle in computer science, engineering, and mathe-
matics departments the users of machine learning now include students and
researchers from every corner of the academic universe, as well as members of
industry, data scientists, entrepreneurs, and machine learning enthusiasts. This
textbook is the result of a complete tearing down of the standard curriculum
of machine learning into its most fundamental components, and a curated re-
assembly of those pieces (painstakingly polished and organized) that we feel
will most benefit this broadening audience of learners. It contains fresh and
intuitive yet rigorous descriptions of the most fundamental concepts necessary
to conduct research, build products, and tinker.

Preface xiii
Book Overview
The second edition of this text is a complete revision of our first endeavor, with
virtually every chapter of the original rewritten from the ground up and eight
new chapters of material added, doubling the size of the first edition. Topics from
the first edition, from expositions on gradient descent to those on One-versus-
All classification and Principal Component Analysis have been reworked and
polished. A swath of new topics have been added throughout the text, from
derivative-free optimization to weighted supervised learning, feature selection,
nonlinear feature engineering, boosting-based cross-validation, and more.
While heftier in size, the intent of our original attempt has remained un-
changed: to explain machine learning, from first principles to practical imple-
mentation, in the simplest possible terms. A big-picture breakdown of the second
edition text follows below.
Part I: Mathematical Optimization (Chapters 2–4)

Mathematical optimization is the workhorse of machine learning, powering not
only the tuning of individual machine learning models (introduced in Part II)
but also the framework by which we determine appropriate models themselves
via cross-validation (discussed in Part III of the text).
In this first part of the text we provide a complete introduction to mathemat-
ical optimization, from basic zero-order (derivative-free) methods detailed in
Chapter 2 to fundamental and advanced first-order and second-order methods
in Chapters 3 and 4, respectively. More specifically this part of the text con-
tains complete descriptions of local optimization, random search methodologies,
gradient descent, and Newton’s method.
Part II: Linear Learning (Chapters 5–9)

In this part of the text we describe the fundamental components of cost function
based machine learning, with an emphasis on linear models.
This includes a complete description of supervised learning in Chapters 5–7
including linear regression, two-class, and multi-class classification. In each of
these chapters we describe a range of perspectives and popular design choices
made when building supervised learners.
In Chapter 8 we similarly describe unsupervised learning, and Chapter 9 con-
tains an introduction to fundamental feature engineering practices including pop-
ular histogram features as well as various input normalization schemes, and
feature selection paradigms.

xiv Preface
Part III: Nonlinear Learning (Chapters 10–14)

In the final part of the text we extend the fundamental paradigms introduced in
Part II to the general nonlinear setting.
We do this carefully beginning with a basic introduction to nonlinear super-
vised and unsupervised learning in Chapter 10, where we introduce the motiva-
tion, common terminology, and notation of nonlinear learning used throughout
the remainder of the text.
In Chapter 11 we discuss how to automate the selection of appropriate non-
linear models, beginning with an introduction to universal approximation. This
naturally leads to detailed descriptions of cross-validation, as well as boosting,
regularization, ensembling, and K-folds cross-validation.
With these fundamental ideas in-hand, in Chapters 12–14 we then dedicate an
individual chapter to each of the three popular universal approximators used in
machine learning: fixed-shape kernels, neural networks, and trees, where we discuss
the strengths, weaknesses, technical eccentricities, and usages of each popular
universal approximator.
To get the most out of this part of the book we strongly recommend that
Chapter 11 and the fundamental ideas therein are studied and understood before
moving on to Chapters 12–14.
Part IV: Appendices

This shorter set of appendix chapters provides a complete treatment on ad-
vanced optimization techniques, as well as a thorough introduction to a range
of subjects that the readers will need to understand in order to make full use of
the text.
Appendix A continues our discussion from Chapters 3 and 4, and describes
advanced first- and second-order optimization techniques. This includes a discussion
of popular extensions of gradient descent, including mini-batch optimization,
momentum acceleration, gradient normalization, and the result of combining these
enhancements in various ways (producing e.g., the RMSProp and Adam first
order algorithms) – and Newton’s method – including regularization schemes
and Hessian-free methods.
Appendix B contains a tour of computational calculus including an introduc-
/
tion to the derivative gradient, higher-order derivatives, the Hessian matrix,
numerical di fferentiation, forward and backward (backpropogation) automatic

di fferentiation, and Taylor series approximations.
Appendix C provides a suitable background in linear and matrix algebra , in-
/
cluding vector matrix arithmetic, the notions of spanning sets and orthogonality,
as well as eigenvalues and eigenvectors.

Preface xv
Readers: How To Use This Book

This textbook was written with first-time learners of the subject in mind, as
well as for more knowledgeable readers who yearn for a more intuitive and
serviceable treatment than what is currently available today. To make full use of
the text one needs only a basic understanding of vector algebra (mathematical
functions, vector arithmetic, etc.) and computer programming (for example,
basic proficiency with a dynamically typed language like Python). We provide

complete introductory treatments of other prerequisite topics including linear
algebra, vector calculus, and automatic di ff erentiation in the appendices of the

text. Example ”roadmaps,” shown in Figures 0.1–0.4, provide suggested paths
for navigating the text based on a variety of learning outcomes and university
courses (ranging from a course on the essentials of machine learning to special
topics – as described further under ”Instructors: How to use this Book” below).
We believe that intuitive leaps precede intellectual ones, and to this end defer
the use of probabilistic and statistical views of machine learning in favor of a
fresh and consistent geometric perspective throughout the text. We believe that
this perspective not only permits a more intuitive understanding of individ-
ual concepts in the text, but also that it helps establish revealing connections
between ideas often regarded as fundamentally distinct (e.g., the logistic re-
gression and Support Vector Machine classifiers, kernels and fully connected
neural networks, etc.). We also highly emphasize the importance of mathemati-
cal optimization in our treatment of machine learning. As detailed in the ”Book
Overview” section above, optimization is the workhorse of machine learning
and is fundamental at many levels – from the tuning of individual models to
the general selection of appropriate nonlinearities via cross-validation. Because
of this a strong understanding of mathematical optimization is requisite if one
wishes to deeply understand machine learning, and if one wishes to be able to
implement fundamental algorithms.
To this end, we place significant emphasis on the design and implementa-
tion of algorithms throughout the text with implementations of fundamental
algorithms given in Python. These fundamental examples can then be used as

building blocks for the reader to help complete the text’s programming exer-
cises, allowing them to ”get their hands dirty” and ”learn by doing,” practicing
the concepts introduced in the body of the text. While in principle any program-
ming language can be used to complete the text’s coding exercises, we highly
recommend using Python for its ease of use and large support community. We
also recommend using the open-source Python libraries NumPy, autograd, and
matplotlib, as well as the Jupyter notebook editor to make implementing and
testing code easier. A complete set of installation instructions, datasets, as well
as starter notebooks for many exercises can be found at
https://github.com/jermwatt/machine_learning_refined
xvi Preface
Instructors: How To Use This Book

Chapter slides associated with this textbook, datasets, along with a large array of
instructional interactive Python widgets illustrating various concepts through-

out the text, can be found on the github repository accompanying this textbook
at
https://github.com/jermwatt/machine_learning_refined
This site also contains instructions for installing Python as well as a number
of other free packages that students will find useful in completing the text’s
exercises.
This book has been used as a basis for a number of machine learning courses
at Northwestern University, ranging from introductory courses suitable for un-
dergraduate students to more advanced courses on special topics focusing on
optimization and deep learning for graduate students. With its treatment of
foundations, applications, and algorithms this text can be used as a primary
resource or in fundamental component for courses such as the following.
Machine learning essentials treatment : an introduction to the essentials
of machine learning is ideal for undergraduate students, especially those in
quarter-based programs and universities where a deep dive into the entirety
of the book is not feasible due to time constraints. Topics for such a course
can include: gradient descent, logistic regression, Support Vector Machines,
One-versus-All and multi-class logistic regression, Principal Component Anal-
ysis, K-means clustering, the essentials of feature engineering and selection,
cross-validation, regularization, ensembling, bagging, kernel methods, fully
connected neural networks, and trees. A recommended roadmap for such a
course – including recommended chapters, sections, and corresponding topics
– is shown in Figure 0.1.
Machine learning full treatment: a standard machine learning course based
on this text expands on the essentials course outlined above both in terms
of breadth and depth. In addition to the topics mentioned in the essentials
course, instructors may choose to cover Newton’s method, Least Absolute
Deviations, multi-output regression, weighted regression, the Perceptron, the
Categorical Cross Entropy cost, weighted two-class and multi-class classifica-
tion, online learning, recommender systems, matrix factorization techniques,
boosting-based feature selection, universal approximation, gradient boosting,
random forests, as well as a more in-depth treatment of fully connected neu-
ral networks involving topics such as batch normalization and early-stopping-
based regularization. A recommended roadmap for such a course – including
recommended chapters, sections, and corresponding topics – is illustrated in
Figure 0.2.
Preface xvii
Mathematical optimization for machine learning and deep learning: such
a course entails a comprehensive description of zero-, first-, and second-order
optimization techniques from Part I of the text (as well as Appendix A) in-
cluding: coordinate descent, gradient descent, Newton’s method, quasi-Newton
methods, stochastic optimization, momentum acceleration, fixed and adaptive
steplength rules, as well as advanced normalized gradient descent schemes
(e.g., Adam and RMSProp). These can be followed by an in-depth description
of the feature engineering processes (especially standard normalization and
PCA-sphering) that speed up (particularly first-order) optimization algorithms.
All students in general, and those taking an optimization for machine learning
course in particular, should appreciate the fundamental role optimization plays
in identifying the ”right” nonlinearity via the processes of boosting and regular-
iziation based cross-validation, the principles of which are covered in Chapter
11. Select topics from Chapter 13 and Appendix B – including backpropagation,
/
batch normalization, and foward backward mode of automatic di ff erentiation
– can also be covered. A recommended roadmap for such a course – including
recommended chapters, sections, and corresponding topics – is given in Figure
0.3.
Introductory portion of a course on deep learning : such a course is best suit-
able for students who have had prior exposure to fundamental machine learning
concepts, and can begin with a discussion of appropriate first order optimiza-
tion techniques, with an emphasis on stochastic and mini-batch optimization,
momentum acceleration, and normalized gradient schemes such as Adam and
RMSProp. Depending on the audience, a brief review of fundamental elements
of machine learning may be needed using selected portions of Part II of the text.
A complete discussion of fully connected networks, including a discussion of
/
backpropagation and forward backward mode of automatic di fferentiation, as
well as special topics like batch normalization and early-stopping-based cross-
validation, can then be made using Chapters 11, 13 , and Appendices A and B of
the text. A recommended roadmap for such a course – including recommended
chapters, sections, and corresponding topics – is shown in Figure 0.4. Additional
recommended resources on topics to complete a standard course on deep learn-
ing – like convolutional and recurrent networks – can be found by visiting the
text’s github repository.

xviii Preface
CHAPTER SECTIONS TOPICS
1 2 3 4 5
Machine Learning Taxonomy
1
1 2 3 4 5
2 Global/Local Optimization Curse of Dimensionality
1 2 3 4 5
3 Gradient Descent
1 2
5 Least Squares Linear Regression
1 2 3 5 6 8
6 Logistic Regression Cross Entropy/Softmax Cost SVMs
1 2 3 4 6
7 One-versus-All Multi-Class Logistic Regression
1 2 3 5
Principal Component Analysis K-means
8
2 7
Feature Engineering Feature Selection
9
1 2 4
Nonlinear Regression Nonlinear Classification
10
1 2 3 4 6 7 9
11 Universal Approximation Cross-Validation Regularization
Ensembling Bagging
1 2 3
Kernel Methods The Kernel Trick
12
1 2 4
Fully Connected Networks Backpropagation
13
1 2 3 4
14 Regression Trees Classification Trees
Figure 0.1 Recommended study roadmap for a course on the essentials of machine
learning, including requisite chapters (left column), sections (middle column), and
corresponding topics (right column). This essentials plan is suitable for
time-constrained courses (in quarter-based programs and universities) or self-study, or
where machine learning is not the sole focus but a key component of some broader
course of study. Note that chapters are grouped together visually based on text layout
detailed under ”Book Overview” in the Preface. See the section titled ”Instructors: How
To Use This Book” in the Preface for further details.

Preface xix
1 2 3 4 5
1 Machine Learning Taxonomy
1 2 3 4 5
Global/Local Optimization Curse of Dimensionality
2
1 2 3 4 5
3 Gradient Descent
1 2 3
4 Newton’s method
1 2 3 4 5 6
5 Least Squares Linear Regression Least Absolute Deviations
Multi-Output Regression Weighted Regression
1 2 3 4 5 6 7 8 9 10
6 Logistic Regression Cross Entropy/Softmax Cost The Perceptron
SVMs Categorical Cross Entropy Weighted Two-Class Classification
1 2 3 4 5 6 7 8 9
7 One-versus-All Multi-Class Logistic Regression
Weighted Multi-Class Classification Online Learning
1 2 3 4 5 6 7
PCA K-means Recommender Systems Matrix Factorization
8
1 2 3 6 7
Feature Engineering Feature Selection Boosting Regularization
9
1 2 3 4 5 6 7
Nonlinear Supervised Learning Nonlinear Unsupervised Learning
10
1 2 3 4 5 6 7 8 9 10 11 12
Universal Approximation Cross-Validation Regularization
11
Ensembling Bagging K-Fold Cross-Validation
1 2 3 4 5 6 7
Kernel Methods The Kernel Trick
12
1 2 3 4 5 6 7 8
Fully Connected Networks Backpropagation Activation Functions
13
Batch Normalization Early Stopping
1 2 3 4 5 6 7 8
14 Regression/Classification Trees Gradient Boosting Random Forests
Figure 0.2 Recommended study roadmap for a full treatment of standard machine
learning subjects, including chapters, sections, as well as corresponding topics to cover.
This plan entails a more in-depth coverage of machine learning topics compared to the
essentials roadmap given in Figure 0.1, and is best suited for senior undergraduate/early
graduate students in semester-based programs and passionate independent readers. See
the section titled ”Instructors: How To Use This Book” in the Preface for further details.
xx Preface
1 2 3 4 5
Machine Learning Taxonomy
1
1 2 3 4 5 6 7
2 Global/Local Optimization Curse of Dimensionality
Random Search Coordinate Descent
1 2 3 4 5 6 7
3 Gradient Descent
1 2 3 4 5
Newton’s Method
4
6
8
Online Learning
7
8
3 4 5
Feature Scaling PCA-Sphering Missing Data Imputation
9
10
5 6
Regularization
11 Boosting
12
6
13 Batch Normalization
14
1 2 3 4 5 6 7 8
Momentum Acceleration Normalized Schemes: Adam, RMSProp
A
Fixed Lipschitz Steplength Rules Backtracking Line Search
Stochastic/Mini-Batch Optimization Hessian-Free Optimization
1 2 3 4 5 6 7 8 9 10
Forward/Backward Mode of Automatic Differentiation
B
Figure 0.3 Recommended study roadmap for a course on mathematical optimization
for machine learning and deep learning, including chapters, sections, as well as topics
to cover. See the section titled ”Instructors: How To Use This Book” in the Preface for
further details.
Preface xxi
2
1 2 3 4 5 6 7
3 Gradient Descent
1 2 3 4 5
10 Nonlinear Regression Nonlinear Classification Nonlinear Autoencoder
1 2 3 4 6
11 Universal Approximation Cross-Validation Regularization
12
1 2 3 4 5 6 7 8
13 Fully Connected Networks Backpropagation Activation Functions
Batch Normalization Early Stopping
14
1 2 3 4 5 6
A Momentum Acceleration Normalized Schemes: Adam, RMSProp
Fixed Lipschitz Steplength Rules Backtracking Line Search
Stochastic/Mini-Batch Optimization
1 2 3 4 5 6 7 8 9 10
B Forward/Backward Mode of Automatic Differentiation
Figure 0.4 Recommended study roadmap for an introductory portion of a course on
deep learning, including chapters, sections, as well as topics to cover. See the section
titled ”Instructors: How To Use This Book” in the Preface for further details.
Acknowledgements
This text could not have been written in anything close to its current form
without the enormous work of countless genius-angels in the Python open-

source community, particularly authors and contributers of NumPy, Jupyter,
and matplotlib. We are especially grateful to the authors and contributors of
autograd including Dougal Maclaurin, David Duvenaud, Matt Johnson, and

Jamie Townsend, as autograd allowed us to experiment and iterate on a host of
new ideas included in the second edition of this text that greatly improved it as
well as, we hope, the learning experience for its readers.
We are also very grateful for the many students over the years that provided
insightful feedback on the content of this text, with special thanks to Bowen
Tian who provided copious amounts of insightful feedback on early drafts of
the work.
Finally, a big thanks to Mark McNess Rosengren and the entire Standing
Passengers crew for helping us stay ca ffeinated during the writing of this text.
1 Introduction to Machine
Learning
1.1 Introduction
Machine learning is a unified algorithmic framework designed to identify com-
putational models that accurately describe empirical data and the phenomena
underlying it, with little or no human involvement. While still a young dis-
cipline with much more awaiting discovery than is currently known, today
machine learning can be used to teach computers to perform a wide array
of useful tasks including automatic detection of objects in images (a crucial
component of driver-assisted and self-driving cars), speech recognition (which
powers voice command technology), knowledge discovery in the medical sci-
ences (used to improve our understanding of complex diseases), and predictive
analytics (leveraged for sales and economic forecasting), to just name a few.
In this chapter we give a high-level introduction to the field of machine
learning as well as the contents of this textbook.
1.2 Distinguishing Cats from Dogs: a Machine Learning

Approach
To get a big-picture sense of how machine learning works, we begin by dis-
cussing a toy problem: teaching a computer how to distinguish between pic-
tures of cats from those with dogs. This will allow us to informally describe the
terminology and procedures involved in solving the typical machine learning
problem.
Do you recall how you first learned about the di ff erence between cats and
dogs, and how they are di ff erent animals? The answer is probably no, as most
humans learn to perform simple cognitive tasks like this very early on in the
course of their lives. One thing is certain, however: young children do not need
some kind of formal scientific training, or a zoological lecture on felis catus and
canis familiaris species, in order to be able to tell cats and dogs apart. Instead,
they learn by example. They are naturally presented with many images of
what they are told by a supervisor (a parent, a caregiver, etc.) are either cats
or dogs, until they fully grasp the two concepts. How do we know when a
child can successfully distinguish between cats and dogs? Intuitively, when
2 Introduction to Machine Learning
they encounter new (images of) cats and dogs, and can correctly identify each
new example or, in other words, when they can generalize what they have learned
to new, previously unseen, examples.
Like human beings, computers can be taught how to perform this sort of task
in a similar manner. This kind of task where we aim to teach a computer to
distinguish between di ff erent types or classes of things (here cats and dogs) is
referred to as a classification problem in the jargon of machine learning, and is
done through a series of steps which we detail below.
1. Data collection. Like human beings, a computer must be trained to recognize
the diff erence between these two types of animals by learning from a batch of
examples, typically referred to as a training set of data. Figure 1.1 shows such a
training set consisting of a few images of di fferent cats and dogs. Intuitively, the
larger and more diverse the training set the better a computer (or human) can
perform a learning task, since exposure to a wider breadth of examples gives
the learner more experience.
Figure 1.1 A training set consisting of six images of cats (highlighted in blue) and six
images of dogs (highlighted in red). This set is used to train a machine learning model
that can distinguish between future images of cats and dogs. The images in this figure
were taken from [1].
2. Feature design. Think for a moment about how we (humans) tell the di ff erence
between images containing cats from those containing dogs. We use color, size,
/
the shape of the ears or nose, and or some combination of these features in order
to distinguish between the two. In other words, we do not just look at an image
as simply a collection of many small square pixels. We pick out grosser details,
or features, from images like these in order to identify what it is that we are
looking at. This is true for computers as well. In order to successfully train a
computer to perform this task (and any machine learning task more generally)
we need to provide it with properly designed features or, ideally, have it find or
learn such features itself.
Designing quality features is typically not a trivial task as it can be very ap-
plication dependent. For instance, a feature like color would be less helpful in
discriminating between cats and dogs (since many cats and dogs share similar
hair colors) than it would be in telling grizzly bears and polar bears apart! More-
over, extracting the features from a training dataset can also be challenging. For
example, if some of our training images were blurry or taken from a perspective
where we could not see the animal properly, the features we designed might
not be properly extracted.
However, for the sake of simplicity with our toy problem here, suppose we
can easily extract the following two features from each image in the training set:
size of nose relative to the size of the head, ranging from small to large, and shape
of ears, ranging from round to pointy.

pointy
ear shape
round
small nose size large
Figure 1.2 Feature space representation of the training set shown in Figure 1.1 where
the horizontal and vertical axes represent the features nose size and ear shape,
respectively. The fact that the cats and dogs from our training set lie in distinct regions
of the feature space reflects a good choice of features.
Examining the training images shown in Figure 1.1 , we can see that all cats
have small noses and pointy ears, while dogs generally have large noses and
round ears. Notice that with the current choice of features each image can now
be represented by just two numbers: a number expressing the relative nose size,
and another number capturing the pointiness or roundness of the ears. In other
words, we can represent each image in our training set in a two-dimensional

feature space where the features nose size and ear shape are the horizontal and
vertical coordinate axes, respectively, as illustrated in Figure 1.2.
3. Model training. With our feature representation of the training data the
machine learning problem of distinguishing between cats and dogs is now a
simple geometric one: have the machine find a line or a curve that separates
the cats from the dogs in our carefully designed feature space. Supposing for
simplicity that we use a line, we must find the right values for its two parameters
– a slope and vertical intercept – that define the line’s orientation in the feature
space. The process of determining proper parameters relies on a set of tools
known as mathematical optimization detailed in Chapters 2 through 4 of this text,
and the tuning of such a set of parameters to a training set is referred to as the
training of a model.
Figure 1.3 shows a trained linear model (in black) which divides the feature
space into cat and dog regions. This linear model provides a simple compu-
tational rule for distinguishing between cats and dogs: when the feature rep-
resentation of a future image lies above the line (in the blue region) it will be
considered a cat by the machine, and likewise any representation that falls below
the line (in the red region) will be considered a dog.

pointy
ear shape
round
Figure 1.3 A trained linear model (shown in black) provides a computational rule for
distinguishing between cats and dogs. Any new image received in the future will be
classified as a cat if its feature representation lies above this line (in the blue region), and
a dog if the feature representation lies below this line (in the red region).
Figure 1.4 A validation set of cat and dog images (also taken from [1]). Notice that the
images in this set are not highlighted in red or blue (as was the case with the training set
shown in Figure 1.1) indicating that the true identity of each image is not revealed to the
learner. Notice that one of the dogs, the Boston terrier in the bottom right corner, has
both a small nose and pointy ears. Because of our chosen feature representation the
computer will think this is a cat!
4. Model validation. To validate the e fficacy of our trained learner we now show
the computer a batch of previously unseen images of cats and dogs, referred to
generally as a validation set of data, and see how well it can identify the animal
in each image. In Figure 1.4 we show a sample validation set for the problem at
hand, consisting of three new cat and dog images. To do this, we take each new
image, extract our designed features (i.e., nose size and ear shape), and simply
check which side of our line (or classifier) the feature representation falls on. In
this instance, as can be seen in Figure 1.5, all of the new cats and all but one dog
from the validation set have been identified correctly by our trained model.
The misidentification of the single dog (a Boston terrier) is largely the result
of our choice of features, which we designed based on the training set in Figure
1.1, and to some extent our decision to use a linear model (instead of a nonlinear
one). This dog has been misidentified simply because its features, a small nose
and pointy ears, match those of the cats from our training set. Therefore, while
it first appeared that a combination of nose size and ear shape could indeed
distinguish cats from dogs, we now see through validation that our training set
was perhaps too small and not diverse enough for this choice of features to be
completely e ff ective in general.

We can take a number of steps to improve our learner. First and foremost we
should collect more data, forming a larger and more diverse training set. Second,
/
we can consider designing including more discriminating features (perhaps eye
color, tail shape, etc.) that further help distinguish cats from dogs using a linear
model. Finally, we can also try out (i.e., train and validate) an array of nonlinear
models with the hopes that a more complex rule might better distinguish be-
tween cats and dogs. Figure 1.6 compactly summarizes the four steps involved
in solving our toy cat-versus-dog classification problem.

pointy
ear shape
round
Figure 1.5 Identification of (the feature representation of) validation images using our
trained linear model. The Boston terrier (pointed to by an arrow) is misclassified as a cat
since it has pointy ears and a small nose, just like the cats in our training set.
Data collection Feature design Model training Model validation
Training set
Validation set
Figure 1.6 The schematic pipeline of our toy cat-versus-dog classification problem. The
same general pipeline is used for essentially all machine learning problems.
1.3 The Basic Taxonomy of Machine Learning Problems

The sort of computational rules we can learn using machine learning generally
fall into two main categories called supervised and unsupervised learning, which
we discuss next.
1.3 The Basic Taxonomy of Machine Learning Problems 7
1.3.1 Supervised learning

Supervised learning problems (like the prototypical problem outlined in Section
1.2) refer to the automatic learning of computational rules involving input /out-
put relationships. Applicable to a wide array of situations and data types, this
type of problem comes in two forms, called regression and classification, depend-
ing on the general numerical form of the output.
Regression
Suppose we wanted to predict the share price of a company that is about to
go public. Following the pipeline discussed in Section 1.2, we first gather a
training set of data consisting of a number of corporations (preferably active in
the same domain) with known share prices. Next, we need to design feature(s)
that are thought to be relevant to the task at hand. The company’s revenue is one
such potential feature, as we can expect that the higher the revenue the more
expensive a share of stock should be. To connect the share price (output) to the
revenue (input) we can train a simple linear model or regression line using our
training data.
share price
share price
revenue revenue
share price
share price
new company’s revenue estimated share price
revenue revenue
Figure 1.7 (top-left panel) A toy training dataset consisting of ten corporations’ share
price and revenue values. (top-right panel) A linear model is fit to the data. This trend
line models the overall trajectory of the points and can be used for prediction in the
future as shown in the bottom-left and bottom-right panels.
The top panels of Figure 1.7 show a toy dataset comprising share price versus
revenue information for ten companies, as well as a linear model fit to this data.
Once the model is trained, the share price of a new company can be predicted
based on its revenue, as depicted in the bottom panels of this figure. Finally,
comparing the predicted price to the actual price for a validation set of data
we can test the performance of our linear regression model and apply changes
as needed, for example, designing new features (e.g., total assets, total equity,
number of employees, years active, etc.) and/or trying more complex nonlinear
models.
This sort of task, i.e., fitting a model to a set of training data so that predictions
about a continuous-valued output (here, share price) can be made, is referred to as

regression. We begin our detailed discussion of regression in Chapter 5 with the
linear case, and move to nonlinear models starting in Chapter 10 and throughout
Chapters 11–14. Below we describe several additional examples of regression to
help solidify this concept.
Example 1.1 The rise of student loan debt in the United States
Figure 1.8 (data taken from [2]) shows the total student loan debt (that is money
borrowed by students to pay for college tuition, room and board, etc.) held
by citizens of the United States from 2006 to 2014, measured quarterly. Over
the eight-year period reflected in this plot the student debt has nearly tripled,
totaling over one trillion dollars by the end of 2014. The regression line (in
black) fits this dataset quite well and, with its sharp positive slope, emphasizes
the point that student debt is rising dangerously fast. Moreover, if this trend
continues, we can use the regression line to predict that total student debt will
surpass two trillion dollars by the year 2026 (we revisit this problem later in
Exercise 5.1).
[in trillions of dollars]
student debt
year
Figure 1.8 Figure associated with Example 1.1, illustrating total student loan debt in the
United States measured quarterly from 2006 to 2014. The rapid increase rate of the debt,
measured by the slope of the trend line fit to the data, confirms that student debt is
growing very fast. See text for further details.

Another random document with
no related content on Scribd:
decentration of the eye as if a prism were prescribed, nature
supplying its own decentration.
Treatment for Correcting Esophoria

in Children
In case of esophoria, regardless of amount, slightly increased
spherical power is frequently prescribed for children. This will
naturally blur or fog the patient’s vision, but in their effort to
overcome the blur, accommodation is relaxed, usually tending to
correct the muscular defect.
In such cases, as a rule, a quarter diopter increased spherical
strength may frequently be added for each degree of esophoria as
determined before the optical correction was made. In a case of 6
degrees of esophoria, the refractionist may prescribe +1.50 diopter
spherical added to the optical correction, which, let us assume, is
+1.00 sph. = -1.00 cyl. ax. 180°, so that the treatment glasses would
be +2.50 sph. = -1.00 ax. 180° (See Procedure on Page 74).
At the end of each three months’ period, the patient should be
requested to return, when the binocular and the duction test should
again be made, comparing results with the work previously
accomplished. An improvement tending to build up the left weak
externus will possibly permit of a decrease of the excessive spherical
power, so that excessive spherical power is reduced until completely
removed, in all probability overcoming the muscular defect.
Esophoria is almost invariably a false condition and frequently is
outgrown under this treatment as the child advances in years. On the
other hand, esophoria uncared for in the child may tend to produce
exophoria in the adult.
How Optical Correction Tends to

Decrease 6° Esophoria in a Child
Assume binocular muscle test made
before optical correction shows
6° Esophoria.
+1. Sph. = -1. Cyl. Ax. 180.
Next, locate faulty muscle by making a duction

test, which shows how abduction of left eye is
made to equal that of right eye, change being
made quarterly with treatment lenses in
accordance with following rule. Note as
abduction is increased, esophoria is reduced.
Rule—prescribe a quarter diopter increased
sphere for each degree of imbalance or 0.25
× 6 equals:
+1.50 added to optical correction.
1/1/19 (assumed date) prescribed treatment
lenses equal:
+2.50 = -1. × 180°.
4/1/19 (3 months later) assuming abduction has

increased from 2° to 3° showing difference of
5 Es. or 0.25 × 5. equals +1.25 added to
optical correction, prescribed treatment lenses
equal:
+2.25 = -1. × 180.
7/1/19 (3 months later), assuming abduction has

increased from 3° to 4° showing difference of
4° Es. or 0.25 × 4 equals +1.00 which added
to optical correction would make prescribed
treatment lenses equal:
+2.00 = -1. × 180.
And so on, every three months treatment lenses

are prescribed until both right and left eye
show 8° of abduction. In this way the
treatment lenses are reduced to original
correction of +1.00 = -100 × 180. This would
have required six changes of lenses, three
months apart—thus consuming 18 months
time.
Chapter X
SECOND METHOD OF TREATMENT—
MUSCULAR EXERCISE
Made With Two Rotary Prisms

and Red Maddox Rod
Exophoria
I f a case is one of exophoria of six degrees, where the second

method of treatment or muscular exercise is in line of routine, it is
essential to first determine through a duction test and the
preparation of the diagram exactly which one of the four muscles are
faulty (Fig. 24).
Having determined, with the aid of the diagram, first, the
existence of 6 degrees of exophoria; second, 18 degrees of
adduction; third, a weak left internus—the next procedure is to
determine what degree of prism will enable the patient to obtain
single binocular vision, with both eyes looking “straight.”
To determine this, place both of the Ski-optometer’s rotary prisms
in position with the handle of each pointing outward horizontally. The
red line or indicator of each prism should then be placed at 30° of the
outer scale (Fig. 26).
The red Maddox rod should be horizontally positioned before the
eye, the white line on indicator pointing to 180° of the scale (Fig. 27).
The strength of the rotary prism before the right eye should
thereupon be reduced by rotating the prism indicator or red line
toward the upper zero (0) to a point where the patient first sees the
red streak—assuming that the red line appears at 42 degrees, that is
30 degrees before the left eye and 12 degrees before the right.
Fig. 26 (A and B)—First position of rotary

prisms to determine amount of prism
exercise to be employed for building up
the weak muscle.
The prism should then be still further reduced until the vertical
streak produced by the Maddox rod directly bisects the muscle
testing spot of light. Assuming that this point be thirty-eight degrees,
which is four degrees less, single binocular vision is produced.
Fig. 27—Position of red Maddox rod used
in conjunction with Fig. 26 for prism
exercising.
For example, sixty degrees of prism power (the combined power
of the two rotary prisms) will usually cause complete distortion.
Therefore, as outlined in Figure 28, the patient, seeing only out of
the right eye, will detect nothing but a white light. By gradually
reducing the strength of the prism before the right, which is the good
eye, the patient will eventually see a red streak off to the left. A
continued and gradual reduction to a point where the red streak
bisects the white light, will determine how much prism power is
required for the patient to obtain single binocular vision, thus
establishing the same image at the same time on each fovea or
retina (Fig. 20).
This has taught the patient to do that which he has never before
accomplished. Therefore, after having been taught how to make the
two eyes work in relation to each other, the natural tendency
thereafter will be to strive for the same relationship of vision with
both eyes. The refractionist should then aim to reduce the excessive
amount of prism required to give binocular vision, which can be
accomplished by muscular exercise.
It must always be remembered before the refractionist is ready to
employ the muscular exercise or second method, that the degree of
prism required to give the patient single binocular vision must be
determined with the optical correction in place. The exercise must be
practised daily in routine, a daily record being essential.
An Assumed Case
We will assume a case where 42 degrees is required to enable
the patient to first see the red streak as produced by the Maddox rod
to the extreme left. Through a continued gradual reduction of 4
degrees (or to 38 degrees), we next learn that the streak was carried
over until it bisected the white spot of light, giving single binocular
vision and producing a position of rest.
Fig. 28—Simplified chart showing the
prism action employed in developing a
weak ocular muscle through alternating
prism exercise. Either side of 38° in
excess of 4° causing diplopia.
The patient has now established the limitation of the exercise,
which is four degrees, this limitation being determined by the
difference between the point where the streak was first seen to the
extreme side and where it bisected the spot. The same amount of
four degrees should then be used for the opposite side, thus
reducing the prism strength to 34 degrees.
This again produces diplopia, because of the lesser amount of
prism power employed to give single binocular vision. The
refractionist should then return to 38 degrees, where single binocular
vision had originally been determined (Fig. 28), alternating back to
42, returning to 38, over to 34, back to 38, and so on. This procedure
should be employed once a day just after meals for about five
minutes, and repeated ten times, constantly striving for a slight
reduction of prism power from day to day.
Effect of Muscular Exercise

This muscular treatment, or constructive exercising, should
enable the patient to overcome his amount of four degrees in either
direction in about a week. Hence in the case showing 38 degrees for
single binocular vision, results may be looked for in about nine
weeks—four degrees divided into 38 degrees. While the patient is
undergoing the treatment, which is nothing more than the
strengthening of the interni muscles or developing adduction, it is
natural to believe that the amount of imbalance is likewise being
conquered. This, however, is readily determined from time to time by
making the binocular muscle test with the phorometer and Maddox
rod, as well as the duction chart test (Fig. 24), as previously outlined.
To fully appreciate the effect of this muscular treatment, the
reader need only hold his head in a stationary position, casting his
eyes several times from the extreme right to the extreme left, not
failing to note the apparent muscular strain. On the other hand, with
the aid of the Ski-optometer’s rotating prisms, the refractionist not
only has complete control of the patient’s muscles at all times, but
scientifically accomplishes muscular exercise without any tiresome
strain, overcoming all possible exertion.
After the case in question has been reduced to 30 degrees,
having no further use for the rotary prism, it may be removed from
before the right eye and the same exercising procedure continued as
before with the remaining left side rotary prism by reducing its power,
until it is likewise down to zero.
Having reduced both prisms to zero, each prism should again be
placed in position with zero graduations vertical and the prism
indicator on upper zero. Both prisms should then be turned
simultaneously about four degrees toward the nasal side of the
patient, thus tending to jointly force corresponding muscles of both
eyes.
Home Treatment for Muscular Exercise—

Square Prism Set Used in Conjunction
With the Ski-Optometer
Where a patient is unable to call each day for this muscular
treatment or exercise, the work will be greatly facilitated by
employing a specially designed set of square prisms ranging in
strength from ½ to 20 degrees for home treatment. As in the case
previously cited, it is necessary to carefully instruct the patient that
the interni muscles must be developed, hence prism base out with
apex in must be employed. Attention should then be directed to a
candle light, serving as a muscle testing spot of light and stationed in
a semi-dark room at an approximate distance of twenty feet.
Having determined through the Ski-optometer the strength of the
prism required after each office treatment, its equivalent should then
be placed in a special square prism trial-frame which permits rotation
of the prism, although the patient is frequently taught to twirl the lens
before the eye. This exercise may be continued for about five
minutes each day.
The patient should also be instructed to call at the end of each
week, when the work may be checked by means of the Ski-
optometer’s rotary prisms, making the duction test as previously
explained and outlined in Fig. 24. It is then possible to determine
whether or not satisfactory results are being obtained. Otherwise the
exercise should be abandoned.
Should the second method employed in the work of muscular
imbalance not prove effective, the third method requiring the use of
prisms would be next in routine.
Chapter XI
THIRD METHOD OF TREATMENT—PRISM
LENSES
When and How Employed
A s stated in the preceding chapter, on ascertaining the failure of

the second muscular treatment or method, prisms are employed
for constant wear. When prism lenses are used, whether the
case is exophoria or esophoria, or right or left hyperphoria, it is
always safe to prescribe one-quarter degree prism for each degree
of prism imbalance for each eye. For example, in a case of 6
degrees of esophoria, a prism of 1½ degree base out should be
prescribed for each eye; or in 6 degrees of exophoria, employ the
same amount of prism, but base in. In right hyperphoria, place the
prism base down before the right eye and up before the left, and vice
versa for left hyperphoria.
It is not always advisable, however, to allow the patient to wear
the same degree of prism for any length of time. Many authorities
suggest a constant change with the idea that a prism is nothing more
than a crutch. Should the same degree be constantly worn, even
though it afforded temporary relief, the eye would become
accustomed to it and the purpose of the prism entirely lost. Prisms
should be prescribed with extreme care, their use being identical
with that of dumb-bells, where weight is first increased to maximum
and subsequently reduced, viz.:
Prism Reduction Method

Where prisms are prescribed, it is considered good practice to
make a binocular muscle test and the duction test (Fig. 24) at the
end of each three months’ period, employing the phorometer,
Maddox rod, and rotary prisms, as already explained.
If the condition shows any decrease, the prism degree should be
proportionately decreased. For example, in the case originally
showing 6 degrees of exophoria, one-quarter degree prism for each
degree of imbalance was prescribed, or 1½ degree for each eye. If
the same case subsequently indicated 4 degrees, only one degree
for each eye should be prescribed—and so on, a gradual reduction
of prism value being constantly sought.
Except in rare cases, prisms should not be prescribed with the
base or apex at oblique angles, as the eye is rarely at rest with such
a correction. An imbalance may be caused by a false condition in
one rectus and a true imbalance in the other, giving one the
impression that cyclophoria exists, as explained in a following
chapter.
Having now employed the three methods, the refractionist can
readily understand that a marked percentage of muscular imbalance
cases may be directly benefited through the aid of the Ski-optometer.
If these three methods of procedure fail, there is nothing left but the
fourth and last method—that of operative procedure.
Chapter XII
A CONDENSATION OF PREVIOUS CHAPTERS
ON THE PROCEDURE FOR MUSCLE TESTING
WITH THE SKI-OPTOMETER
T he present chapter, intended for those desiring a synopsis or

condensed summary of muscular imbalance work, should prove
of the utmost assistance to the busy refractionist. Muscular
imbalance work can be successfully conducted if the following
routine is studied and memorized, with the Ski-optometer constantly
before the reader. The chapters containing the corresponding figures
and diagrams or illustrations will then be readily comprehended. It is
also important to carefully note the captions under each diagram.
1. Without any testing lenses before patient’s eyes, direct
attention to a 20-foot distant muscle testing spot of light (Fig. 9).
2. Place phorometer handle vertically (Fig. 16).
Place red Maddox rod vertically (Fig. 15). Patient should see a
white spot of light, and a red horizontal streak (Fig. 17).
Simply turn phorometer handle until horizontal streak bisects
white spot of light. Pointer then indicates amount of deviation on red
scale. Ignore cases less than 1° hyperphoria, whether right or left
designated by (R. H.—L. H.).
3. Place phorometer handle horizontally (Fig. 19).
Place red Maddox rod horizontally (Fig. 18). Patient should see a
white spot of light and a vertical red streak (Fig. 20).
Simply turn phorometer handle until red streak bisects spot of
light. Pointer indicates amount of deviation on white scale, whether
esophoria or exophoria designated by (Es—Ex).
4. Ignore all exophoria cases, less than 3°.
Ignore all esophoria cases, less than 5°—except in children,
ignore less than 3° of esophoria.
5. Always make the above or binocular muscle test—with
phorometer and red Maddox before optical correction or (test for
spheres and cylinders) and again after optical correction where case
shows more than 1-3-5 rule, to determine whether muscles are
aggravated or benefited.
6. In cases showing more than the 1-3-5 rule, shown in above
No. 4, make monocular duction test first with rotary prism before
patient’s right eye,—then with rotary prism before left eye to find
faulty muscle and determine which eye is affected.
7. To test adduction, prism base out is required. Rotary prism’s
red line or indicator should be rotated from zero outwardly. To test
abduction, base in is required. Indicator should be rotated inwardly
from zero (Fig. 22). Power of adduction as compared with abduction,
is normally 3 to 1—usually rated 24 to 8.
8. To test superduction, base down is required. Rotary prism’s
line or indicator should be rotated downward from zero. To test
subduction, base up is required. Indicator should be rotated upward
from zero. Power of superduction as compared with subduction, is
normally equal—usually rated 2 for each (Fig. 23).
9. Direct patient’s attention to largest letter on distant chart,
usually letter “E,” rotating red line indicator of rotary prism outlined in
above No. 7 and No. 8, until diplopia is first procured.
10. The use of a duction chart on a record card, quickly
designates pull for each of four muscles (Fig. 24), illustrating an
assumed case of—
1st—6D of Exophoria.
2nd—18° adduction (which must be developed to 24°).
3rd—Patient has a left weak internus.
11. Employ First Method—Optical Correction—to effect
treatment.
12. Assuming a case of a child with 6° of esophoria—8° of right
abduction and 2° left abduction indicating a left weak externus,
prescribe a quarter diopter increased plus spherical power for each
degree of imbalance, thus adding +1.50D spherical to optical
correction. This is the first method of treatment. This requires a
thorough reading of Chapter IX on Treatment for Correcting
Esophoria in Children and a careful study of the formula. For
synopsis see Page 74.
Four Methods of Treating an Imbalance Case

When the Preceding One Fails
1st—Optical correction;
2nd—Muscular exercise or treatment;
75% are Curable with First and Second Methods.
3rd—Prisms;
5% are Curable with Third Method.
4th—Operation;
20% are Curable with Fourth Method.
13. When first method of treatment fails, Employ Second
Method—Muscular Exercise—to effect treatment.
1st—Find degree of prism patient will accept to produce single
binocular vision with optical correction on, placing both rotary prisms
in position, handles horizontal, red line on 30° of temporal scale of
each, giving total value to 60° (Fig. 26a and b).
2nd—Also place red Maddox rod before patient’s eye (rods
horizontal) (Fig. 18), calling patient’s attention to usual muscle
testing spot of light.
3rd—Reduce prism before good eye until red streak appears,
noting degree (which we assume shows 42° the combined total
value of both prisms) slowly continue to decrease prism until streak
bisects spot. Assume this shows total of 38°. Either side of 38° in
excess of 4° (38 to 42) produces diplopia. Prisms must only be
rotated from 38° to 42° back to 38° over to 34°—back to 38° over to
42°—back again to 38° and so on—exercise to be continued daily
ten times for five minutes (Fig. 28).
4th—At end of each week, duction test should again be made.
Duction chart should show a tendency to reduce exophoria by a
gradual building up of adduction, approximately one week is usually
sufficient to teach patient to hold streak within the spot (between 38°
and 42°). Exercise to be continued until both prisms are worked
down to zero. Exercise tends to teach patient how to establish same
image on each fovea or retina at same time.
5th—If patient is unable to call daily for treatment, employ home
treatment. (Read “Home Treatment for Muscular Exercising,” Page
82).
Employ Third Method—Use of Prisms for Constant Wear to
effect treatment.
Prisms
1st. Where a case cannot be reduced through use of first two
methods, as for example in a case of 6° of exophoria, prescribe ¼ of
amount of imbalance (¼ × 6 = 1½°) for each eye—base in—or
esophoria base out, hyperphoria base up on eye affected.
2nd. Advise patient to call every three months and make duction
test (Fig. 24). If no improvement in condition, after wearing prisms
six months, operative means is suggested.
Assume a case is benefited, reduce prism power according to
rule; ¼D prism for each degree of imbalance.
Cyclophoria
This work being of a technical nature, it is deemed best for the
reader to study Chapter XIII and XIV.
Chapter XIII
CYCLOPHORIA
Made with Maddox Rods

and Rotary Prisms
C yclophoria, a condition affecting the oblique muscles of the

eye, is caused by its rotation. It is detected in the following
manner by the combined use of the red and white Maddox rods
and the rotary prism.
Fig. 29—Position of rotary prism for
producing diplopia in testing cyclophoria
with prism placed at 8° base up.
Darken the room and direct the patient’s attention to the usual
muscle-testing spot of light, located approximately twenty feet away
and on a direct plane with the patient’s eye. The optical correction, if
one is required, should always be left in place—just as in making
other previously described muscle tests.
The rotary prism should then be brought before the patient’s right
eye with the handle-pointing upward and with zero graduations
horizontal. The indicator or red line should then be rotated upward
from zero to eight upon the prism scale, creating the equivalent of a
prism of 8 diopters with base up (Fig. 29). This normally caused

PDF Machine Learning Refined Foundations Algorithms and Applications Second Edition Borhani Ebook Full Chapter

Uploaded by

Copyright:

Available Formats

PDF Machine Learning Refined Foundations Algorithms and Applications Second Edition Borhani Ebook Full Chapter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF Machine Learning Refined Foundations Algorithms and Applications Second Edition Borhani Ebook Full Chapter

Uploaded by

Copyright:

Available Formats

Machine learning refined foundations

algorithms and applications Second

Machine Learning Refined: Foundations, Algorithms, and

Foundations of Machine Learning second edition Mehryar

Machine Learning Algorithms for Industrial Applications

Machine Learning Foundations: Supervised, Unsupervised,

Analysis for computer scientists foundations methods

Learning Microsoft Cognitive Services leverage machine

Machine learning and security protecting systems with

Machine Learning and its Applications 1st Edition Peter

Aggelos K. Katsaggelos is the Joseph Cummings Professor at Northwestern University,

Foundations, Algorithms, and Applications

Cambridge University Press is part of the University of Cambridge.

Deb, Robert, and Terri

Soheila, Ali, and Maryam

Ειρήνη Ζωή Σοφία

4.2 The Geometry of Second-Order Taylor Series 78

8 Linear Unsupervised Learning 208

8.1 Introduction 208

8.2 Fixed Spanning Sets, Orthonormality, and Projections 208

8.3 The Linear Autoencoder and Principal Component Analysis 213

8.4 Recommender Systems 219

8.5 K-Means Clustering 221

8.6 General Matrix Factorization Techniques 227

8.7 Conclusion 230

8.8 Exercises 231

8.9 Endnotes 233

9 Feature Engineering and Selection 237

9.1 Introduction 237

9.2 Histogram Features 238

9.3 Feature Scaling via Standard Normalization 249

9.4 Imputing Missing Values in a Dataset 254

9.5 Feature Scaling via PCA-Sphering 255

9.6 Feature Selection via Boosting 258

9.7 Feature Selection via Regularization 264

9.8 Conclusion 269

9.9 Exercises 269

Part III Nonlinear Learning 273

10 Principles of Nonlinear Feature Engineering 275

10.1 Introduction 275

10.2 Nonlinear Regression 275

10.3 Nonlinear Multi-Output Regression 282

10.4 Nonlinear Two-Class Classiﬁcation 286

10.5 Nonlinear Multi-Class Classiﬁcation 290

10.6 Nonlinear Unsupervised Learning 294

10.7 Conclusion 298

10.8 Exercises 298

11 Principles of Feature Learning 304

11.1 Introduction 304

11.2 Universal Approximators 307

11.3 Universal Approximation of Real Data 323

11.5 E cient Cross-Validation via Boosting 340

11.6 Eﬃ cient Cross-Validation via Regularization 350

11.7 Testing Data 361

11.8 Which Universal Approximator Works Best in Practice? 365

11.9 Bagging Cross-Validated Models 366

11.10 K-Fold Cross-Validation 373

A.5 Mini-Batch Optimization 487

A.6 Conservative Steplength Rules 490