Get Mathematical Foundations of Data Science 1st Edition Tomas Hrycej Free All Chapters

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Full download test bank at ebookmeta.

com

Mathematical Foundations of Data Science 1st


Edition Tomas Hrycej

For dowload this book click LINK or Button below

https://ebookmeta.com/product/mathematical-
foundations-of-data-science-1st-edition-tomas-
hrycej-2/

OR CLICK BUTTON

DOWLOAD EBOOK

Download More ebooks from https://ebookmeta.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Mathematical Foundations of Data Science 1st Edition


Tomas Hrycej

https://ebookmeta.com/product/mathematical-foundations-of-data-
science-1st-edition-tomas-hrycej/

Mathematical Foundations of Data Science Using R (2nd


Edition) Frank Emmert-Streib

https://ebookmeta.com/product/mathematical-foundations-of-data-
science-using-r-2nd-edition-frank-emmert-streib/

Mathematical Foundations of Big Data Analytics Vladimir


Shikhman

https://ebookmeta.com/product/mathematical-foundations-of-big-
data-analytics-vladimir-shikhman/

Mathematical Methods in Data Science Jingli Ren

https://ebookmeta.com/product/mathematical-methods-in-data-
science-jingli-ren/
Fundamentals of Cryptography: Introducing Mathematical
and Algorithmic Foundations (Undergraduate Topics in
Computer Science) Duncan Buell

https://ebookmeta.com/product/fundamentals-of-cryptography-
introducing-mathematical-and-algorithmic-foundations-
undergraduate-topics-in-computer-science-duncan-buell/

Mathematical Foundations of Infinite Dimensional


Statistical Models Evarist Giné

https://ebookmeta.com/product/mathematical-foundations-of-
infinite-dimensional-statistical-models-evarist-gine/

The Crystal Ball Instruction Manual Volume Two:


Foundations for Data Science Stephen Davies

https://ebookmeta.com/product/the-crystal-ball-instruction-
manual-volume-two-foundations-for-data-science-stephen-davies/

Foundations of Mathematical Modelling for Engineering


Problem Solving 1st Edition Parikshit Narendra Mahalle

https://ebookmeta.com/product/foundations-of-mathematical-
modelling-for-engineering-problem-solving-1st-edition-parikshit-
narendra-mahalle/

Statistical Foundations, Reasoning and Inference: For


Science and Data Science (Springer Series in
Statistics) Göran Kauermann

https://ebookmeta.com/product/statistical-foundations-reasoning-
and-inference-for-science-and-data-science-springer-series-in-
statistics-goran-kauermann/
Texts in Computer Science

Series Editors
David Gries, Department of Computer Science, Cornell University, Ithaca, NY,
USA
Orit Hazzan , Faculty of Education in Technology and Science, Technion—Israel
Institute of Technology, Haifa, Israel
Titles in this series now included in the Thomson Reuters Book Citation Index!
‘Texts in Computer Science’ (TCS) delivers high-quality instructional content for
undergraduates and graduates in all areas of computing and information science,
with a strong emphasis on core foundational and theoretical material but inclusive
of some prominent applications-related content. TCS books should be reasonably
self-contained and aim to provide students with modern and clear accounts of topics
ranging across the computing curriculum. As a result, the books are ideal for
semester courses or for individual self-study in cases where people need to expand
their knowledge. All texts are authored by established experts in their fields,
reviewed internally and by the series editors, and provide numerous examples,
problems, and other pedagogical tools; many contain fully worked solutions.
The TCS series is comprised of high-quality, self-contained books that have
broad and comprehensive coverage and are generally in hardback format and
sometimes contain color. For undergraduate textbooks that are likely to be more
brief and modular in their approach, require only black and white, and are under
275 pages, Springer offers the flexibly designed Undergraduate Topics in Computer
Science series, to which we refer potential authors.
Tomas Hrycej • Bernhard Bermeitinger •
Matthias Cetto • Siegfried Handschuh

Mathematical
Foundations of Data
Science

123
Tomas Hrycej Bernhard Bermeitinger
Institute of Computer Science Institute of Computer Science
University of St. Gallen University of St. Gallen
St. Gallen, Switzerland St. Gallen, Switzerland

Matthias Cetto Siegfried Handschuh


Institute of Computer Science Institute of Computer Science
University of St. Gallen University of St. Gallen
St. Gallen, Switzerland St. Gallen, Switzerland

ISSN 1868-0941 ISSN 1868-095X (electronic)


Texts in Computer Science
ISBN 978-3-031-19073-5 ISBN 978-3-031-19074-2 (eBook)
https://doi.org/10.1007/978-3-031-19074-2
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Data Science is a rapidly expanding field with increasing relevance. There are
correspondingly numerous textbooks about the topic. They usually focus on various
Data Science methods. In a growing field, there is a danger that the number of
methods grows, too, in a pace that it is difficult to compare their specific merits and
application focus.
To cope with this method avalanche, the user is left alone with the judgment
about the method selection. He or she can be helped only if some basic principles
such as fitting model to data, generalization, and abilities of numerical algorithms
are thoroughly explained, independently from the methodical approach. Unfortu-
nately, these principles are hardly covered in the textbook variety. This book would
like to close this gap.

For Whom Is This Book Written?

This book is appropriate for advanced undergraduate or master’s students in


computer science, Artificial Intelligence, statistics or related quantitative subjects,
as well as people from other disciplines who want to solve Data Science tasks.
Elements of this book can be used earlier, e.g., in introductory courses for Data
Science, engineering, and science students who have the required mathematical
background.
We developed this book to support a semester course in Data Science, which is
the first course in our Data Science specialization in computer science. To give you
an example of how we use this book in our own lectures, our Data Science course
consists of two parts:
• In the first part, a general framework for solving Data Science tasks is described,
with a focus on facts that can be supported by mathematical and statistical
arguments. This part is covered by this book.
• In the second part of the course, concrete methods from multivariate statistics
and machine learning are introduced. For this part, many well-known Springer
textbooks are available (e.g., those by Hastie and Tibshirani or Bishop), which
are used to accompany this part of the course. We did not intend to duplicate this
voluminous work in our book.

v
vi Preface

Besides students as the intended audience, we also see a benefit for researchers
in the field who want to gain a proper understanding of the mathematical foun-
dations instead of sole computing experience as well as practitioners who will get
mathematical exposure directed to make clear the causalities.

What Makes This Book Different?

This book encompasses the formulation of typical tasks as input/output mappings,


conditions for successful determination of model parameters with good general-
ization properties as well as convergence properties of basic classes of numerical
algorithms used for parameter fitting.
In detail, this book focuses on topics such as
• generic type of Data Science task and the conditions for its solvability;
• trade-off between model size and volume of data available for its identification
and its consequences for model parametrization (frequently referred to as
learning);
• conditions to be satisfied for good performance of the model on novel data, i.e.,
generalization; and
• conditions under which numerical algorithms used in Data Science operate and
what performance can be expected from them.
These are fundamental and omnipresent problems of Data Science. They are
decisive for the success of the application, more than a detailed selection of a
computing method. These questions are scarcely, or not at all, treated in other Data
Science and Machine Learning textbooks. Students and many data engineers and
researchers are frequently not aware of these conditions and, neglecting them,
produce suboptimal solutions.
In this book, we did not focus on Data Science technology and methodology
except where it is necessary to explain general principles, because we felt that this
was mostly covered in existing books.
In summary, this textbook is an important addition to all existing Data Science
courses.

Comprehension Checks

In all chapters, important theses are summarized in their own paragraphs. All
chapters have comprehension checks for the students.
Preface vii

Acknowledgments

During the writing of this book, we have greatly benefited from students taking our
course and providing feedback on earlier drafts of the book. We would like to
explicitly mention the help of Jonas Herrmann for thorough reading of the manu-
script. He gave us many helpful hints for making the explanations comprehensible,
in particular from a student’s viewpoint. Further, we want to thank Wayne Wheeler
and Sriram Srinivas from Springer for their support and their patience with us in
finishing the book.
Finally, we would like to thank our families for their love and support.

St. Gallen, Switzerland Tomas Hrycej


September 2022 Bernhard Bermeitinger
Matthias Cetto
Siegfried Handschuh
Contents

1 Data Science and Its Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Part I Mathematical Foundations


2 Application-Specific Mappings and Measuring the Fit to Data . . . . . 7
2.1 Continuous Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Nonlinear Continuous Mappings . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Mappings of Probability Distributions . . . . . . . . . . . . . . . . 12
2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Special Case: Two Linearly Separable Classes . . . . . . . . . 16
2.2.2 Minimum Misclassification Rate for Two Classes . . . . . . . 21
2.2.3 Probabilistic Classification . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Generalization to Multiple Classes . . . . . . . . . . . . . . . . . . 31
2.3 Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Spatial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Mappings Received by “Unsupervised Learning” . . . . . . . . . . . . . 41
2.5.1 Representations with Reduced Dimensionality . . . . . . . . . . 43
2.5.2 Optimal Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.3 Clusters as Unsupervised Classes . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Data Processing by Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Feedforward and Feedback Networks . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Data Processing by Feedforward Networks . . . . . . . . . . . . . . . . . 59
3.3 Data Processing by Feedback Networks . . . . . . . . . . . . . . . . . . . . 62
3.4 Feedforward Networks with External Feedback . . . . . . . . . . . . . . 67
3.5 Interpretation of Network Weights . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Connectivity of Layered Networks . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Shallow Networks Versus Deep Networks . . . . . . . . . . . . . . . . . . 79
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Learning and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Algebraic Conditions for Fitting Error Minimization . . . . . . . . . . . 84
4.2 Linear and Nonlinear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 90

ix
x Contents

4.3 Overdetermined Case with Noise . . . . . . . . . . . . . . . . . . . . . . . . . 95


4.4 Noise and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Generalization in the Underdetermined Case . . . . . . . . . . . . . . . . 109
4.6 Statistical Conditions for Generalization . . . . . . . . . . . . . . . . . . . . 111
4.7 Idea of Regularization and Its Limits . . . . . . . . . . . . . . . . . . . . . . 113
4.7.1 Special Case: Ridge Regression . . . . . . . . . . . . . . . . . . . . 116
4.8 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.9 Parameter Reduction Versus Regularization . . . . . . . . . . . . . . . . . 120
5 Numerical Algorithms for Data Science . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Classes of Minimization Problems . . . . . . . . . . . . . . . . . . . . . . . . 130
5.1.1 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1.2 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1.3 Non-convex Local Optimization . . . . . . . . . . . . . . . . . . . . 132
5.1.4 Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Gradient Computation in Neural Networks . . . . . . . . . . . . . . . . . . 136
5.3 Algorithms for Convex Optimization . . . . . . . . . . . . . . . . . . . . . . 138
5.4 Non-convex Problems with a Single Attractor . . . . . . . . . . . . . . . 142
5.4.1 Methods with Adaptive Step Size . . . . . . . . . . . . . . . . . . . 144
5.4.2 Stochastic Gradient Methods . . . . . . . . . . . . . . . . . . . . . . 145
5.5 Addressing the Problem of Multiple Minima . . . . . . . . . . . . . . . . 152
5.5.1 Momentum Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Part II Applications
6 Specific Problems of Natural Language Processing . . . . . . . . . . . . . . 167
6.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.2 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3 Recurrent Versus Sequence Processing Approaches . . . . . . . . . . . 171
6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6 Autocoding and Its Modification . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.7 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.7.1 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.7.2 Position-Wise Feedforward Networks . . . . . . . . . . . . . . . . 184
6.7.3 Residual Connection and Layer Normalization . . . . . . . . . 184
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7 Specific Problems of Computer Vision . . . . . . . . . . . . . . . . . . . . . . . 195
7.1 Sequence of Convolutional Operators . . . . . . . . . . . . . . . . . . . . . 196
7.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Contents xi

7.1.3 Implementations of Convolutional Neural Networks . . . . . 200


7.2 Handling Invariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3 Application of Transformer Architecture to Computer Vision . . . . 203
7.3.1 Attention Mechanism for Computer Vision . . . . . . . . . . . . 203
7.3.2 Division into Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Acronyms

AI Artificial Intelligence
ARMA Autoregressive Moving Average
BERT Bidirectional Encoder Representations from Transformers
CNN Convolutional Neural Network
CV Computer Vision
DL Deep Learning
DS Data Science
FIR Finite Impulse Response
GRU Gated Recurrent Unit
IIR Infinite Impulse Response
ILSVRC ImageNet Large Scale Visual Recognition Challenge
LSTM Long Short-Term Memory Neural Network
MIMO Multiple Input/Multiple Output
MSE Mean Square Error
NLP Natural Language Processing
OOV Out-of-Vocabulary
PCA Principal Component Analysis
ReLU Rectified linear units
ResNet Residual Neural Network
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SISO Single Input/Single Output
SVD Singular value decomposition
SVM Support vector machine

xiii
Data Science and Its Tasks
1

As the name Data Science (DS) suggests, it is a scientific field concerned with data.
However, this definition would encompass the whole of information technology.
This is not the intention behind delimiting the Data Science. Rather, the focus is on
extracting useful information from data.
In the last decades, the volume of processed and digitally stored data has reached
huge dimensions. This has led to a search for innovative methods capable of coping
with large data volumes. A naturally analogous context is that of intelligent infor-
mation processing by higher living organisms. They are supplied by a continuous
stream of voluminous sensor data (delivered by senses such as vision, hearing, or
tactile sense) and use this stream for immediate or delayed acting favorable to the
organism. This fact makes the field of Artificial Intelligence (AI) a natural source of
potential ideas for Data Science. These technologies complement the findings and
methods developed by classical disciplines concerned with data analysis, the most
prominent of which is statistics.
The research subject of Artificial Intelligence (AI) is all aspects of sensing, recog-
nition, and acting necessary for intelligent or autonomous behavior. The scope of
Data Science is similar but focused on the aspects of recognition. Given the data,
collected by sensing or by other data accumulation processes, the Data Science tasks
consist in recognizing patterns interesting or important in some defined sense. More
concretely, these tasks can adopt the form of the following variants (but not limited
to them):

• recognizing one of the predefined classes of patterns (a classification task). An


example is recognition of an object in a visual scene characterized by image pixel
data or determining the semantic meaning of an ambiguous phrase;
• finding a quantitative relationship between some data (a continuous mapping).
Such relationships are frequently found in technical and economic data, for ex-
ample, the dependence of interest rate on the growth rate of domestic product or
money supply;
• finding characteristics of data that are substantially more compact than the origi-
nal data (data compression). A trivial example is characterizing the data about a

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


T. Hrycej et al., Mathematical Foundations of Data Science, Texts in Computer Science,
https://doi.org/10.1007/978-3-031-19074-2_1
2 1 Data Science and Its Tasks

population by an arithmetic mean or standard deviation of the weight or height of


individual persons. A more complex example is describing the image data by a set
of contour edges.

Depending on the character of the task, the data processing may be static or
dynamic. The static variant is characterized by a fixed data set in which a pattern is
to be recognized. This corresponds to the mathematical concept of a mapping: Data
patterns are mapped to their pattern labels. Static recognition is a widespread setting
for image processing, text search, fraud detection, and many others.
With dynamic processing, the recognition takes place on a stream of data provided
continuously in time. The pattern searched can be found only by observing this stream
and its dynamics. A typical example is speech recognition.
Historically, the first approaches to solving these tasks date back to several cen-
turies ago and have been continually developed. The traditional disciplines have
been statistics as well as systems theory investigating dynamic system behavior.
These disciplines provide a large pool of scientifically founded findings and meth-
ods. Their natural focus on linear systems results from the fact that these systems are
substantially easier to treat analytically. Although some powerful theory extensions
to nonlinear systems are available, a widespread approach is to treat the nonlinear
systems as locally linear and use linear theory tools.
AI has passed several phases. Its origins in the 1950s focused on simple learn-
ing principles, mimicking basic aspects of the behavior of biological neuron cells.
Information to be processed has been represented by real-valued vectors. The corre-
sponding computing procedures can be counted to the domain of numerical mathe-
matics. The complexity of algorithms has been limited by the computing power of
information processing devices available at that time. The typical tasks solved have
been simple classification problems encompassing the separation of two classes.
Limitations of this approach with the given information processing technology
have led to an alternative view: logic-based AI. Instead of focusing on sensor informa-
tion, logical statements, and correspondingly, logically sound conclusions have been
investigated. Such data has been representing some body of knowledge, motivating
to call the approach knowledge based. The software systems for such processing
have been labeled “expert systems” because of the necessity of encoding expert
knowledge in an appropriate logic form.
This field has reached a considerable degree of maturity in machine processing of
logic statements. However, the next obstacle had to be surmounted. The possibility of
describing a real world in logic terms showed its limits. Many relationships important
for intelligent information processing and behavior turned out to be too diffuse for
the unambiguous language of logic. Although some attempts to extend the logic
by probabilistic or pseudo-probabilistic attributes (fuzzy logic) delivered applicable
results, the next change of paradigm has taken place.
With the fast increase of computing power, also using interconnected computer
networks, the interest in the approach based on numerical processing of real-valued
data revived. The computing architectures are, once more, inspired by neural systems
of living organisms. In addition to the huge growth of computing resources, this phase
1 Data Science and Its Tasks 3

is characterized by more complex processing structures. Frequently, they consist of


a stack of multiple subsequent processing layers. Such computing structures are
associated with the recently popular notion of Deep Learning (DL).
The development of the corresponding methods has mostly been spontaneous and
application driven. It has also taken place in several separate scientific communities,
depending on their respective theoretical and application focus: computer scientists,
statisticians, biologists, linguists as well as engineers of systems for image process-
ing, speech processing, and autonomous driving. For some important applications
such as Natural Language Processing (NLP) or Computer Vision (CV), many trials
for solutions have been undertaken followed by equally numerous failures.
It would be exaggerated to characterize the usual approach as a “trial-and-error”
approach. However, so far, no unified theory of the domain has been developed.
Also, some popular algorithms and widely accepted recommendations for their use
have not reached the maturity in implementation and theoretical foundations. This
motivates the need for a review of mathematical principles behind the typical Data
Science solutions, for the user to be able to make appropriate choices and to avoid
failures caused by typical pitfalls.
Such a basic review is done in the following chapters of this work. Rather than
attempting to provide a theory of Data Science (DS) (which would be a very ambitious
project), it compiles mathematical concepts useful in looking for DS solutions. These
mathematical concepts are also helpful in understanding which configurations of data
and algorithms have the best chance for success. Rather than presenting a long list
of alternative methods, the focus is rather on choices common to many algorithms.
What is adopted is the view of someone facing a new DS application. The questions
that immediately arise are as follows:

• What type of generic task is this (forecast or classification, static or dynamic


system, etc.)?
• Which are the requirements on appropriate data concerning their choice and quan-
tity?
• Which are the conditions for generalization to unseen cases and their consequences
for dimensioning the task?
• Which algorithms have the largest potential for good solutions?

The authors hope to present concise and transparent answers to these questions
wherever allowed by the state of the art.
Part I
Mathematical Foundations
Application-Specific Mappings
and Measuring the Fit to Data 2

Information processing algorithms consist of receiving input data and computing


output data from them. On a certain abstraction level, they can be generally described
by some kind of mapping input data to output data. Depending on software type and
application, this can be more or less explicit. A dialogue software receives its input
data successively and sometimes in dependence on previous input and output. By
contrast, the input of a weather forecast software is a set of measurements from
which the forecast (the output data) is computed by a mathematical algorithm. The
latter case is closer to the common idea of a mapping in the sense of a mathematical
function that delivers a function value (possibly a vector) from a vector of arguments.
Depending on the application, it may be appropriate to call the input and output
vectors patterns.
In this sense, most DS applications amount to determining some mapping of an
input pattern to an output pattern. In particular, the DS approach is gaining this
mapping in an inductive manner from large data sets.
Both input and output patterns are typically described by vectors. What is sought
is a vector mapping
y = f (x) (2.1)
assigning an output vector y to a vector x.
This mapping may be arbitrarily complex but some types are easily tractable while
others are more difficult. The simplest type of mapping is linear:
y = Bx (2.2)
Linear mapping is the type the most thoroughly investigated, providing volu-
minous theory concerning its properties. Nevertheless, its limitations induced the
interest in nonlinear alternatives, the recent one being neural networks with growing
popularity and application scope.
The approach typical for DS is looking for a mapping that fits to the data from
some data collection. This fitting is done with the help of a set of variable parameters
whose values are determined so that the fit is the best or even exact. Every mapping
of type (2.1) can be written as
y = f (x, w) (2.3)
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 7
T. Hrycej et al., Mathematical Foundations of Data Science, Texts in Computer Science,
https://doi.org/10.1007/978-3-031-19074-2_2
8 2 Application-Specific Mappings and Measuring …

with a parameter vector w. For linear mappings of type (2.2), the parameter vector
w consists of the elements of matrix B.
There are several basic application types with their own interpretation of the
mapping sought. The task of fitting a mapping of a certain type to the data requires a
measure of how good this fit is. An appropriate definition of this measure is important
for several reasons:

• In most cases, a perfect fit with no deviation is not possible. To select from alter-
native solutions, comparing the values of fit measure is necessary.
• For optimum mappings of a simple type such as linear ones, analytical solutions
are known. Others can only be found by numerical search methods. To control the
search, repeated evaluation of the fit measure is required.
• The most efficient search methods require smooth fit measures with existing or
even continuous gradients, to determine the search direction where the chance for
improvement is high.

For some mapping types, these two groups of requirements are difficult to meet
in a single fit measure.
There are also requirements concerning the correspondence of the fit measure
appropriate from the viewpoint of the task on one hand and of that used for (mostly
numerical) optimization on the other hand:

• The very basic requirement is that both fit measures should be the same. This
seemingly trivial requirement may be difficult to satisfy for some tasks such as
classification.
• It is desirable that a perfect fit leads to a zero minimum of the fit measure. This is
also not always satisfied, for example, with likelihood-based measures. Difficulties
to satisfy these requirements frequently lead to using different measures for the
search on one hand and for the evaluation of the fit on the other hand. In such
cases, it is preferable if both measures have at least a common optimum.

These are the topics of the following sections.

2.1 Continuous Mappings

The most straightforward application type is using the mapping as what it mathe-
matically is: a mapping of real-valued input vectors to equally real-valued output
vectors. This type encompasses many physical, technical, and econometric applica-
tions. Examples of this may be:
2.1 Continuous Mappings 9

Fig. 2.1 Error functions

• Failure rates (y) determined from operation time and conditions of a component
(x).
• Credit scoring, mapping the descriptive features (x) of the credit recipient to a
number denoting the creditworthiness (y).
• Macroeconomic magnitudes such as inflation rate (y) estimated from others such
as unemployment rate and economic growth (x).

If a parameterized continuous mapping is to be fitted to data, the goal of fitting is to


minimize the deviation between the true values y and the estimated values f (x, w).
So, the basic version of fitting error is
e = y − f (x, w) (2.4)
It is desirable for this error to be a small positive or negative number. In other
words, it is its absolute value |e| that is to be minimized for all training examples.
This error function is depicted in Fig. 2.1 and labeled absolute error.
An obvious property of this error function is its lack of smoothness. Figure 2.2
shows its derivatives: It does not exist at e = 0 and makes a discontinuous step at
that position e = 0.
This is no problem from the application’s view. However, a discontinuous first
derivative is strongly adverse for the most well-converging numerical algorithms
that have a potential to be used as fitting or training algorithms. It is also disadvan-
tageous for analytical treatment. The error minimum can be sometimes determined
analytically, seeking for solutions with zero derivative. But equations containing dis-
continuous functions are difficult to solve. Anyway, the absolute value as an error
measure is used in applications with special requirements such as enhanced robust-
ness against data outliers.
Numerical tractability is the reason why a preferred form of error function is a
square error e2 , also shown in Fig. 2.1. Its derivative (Fig. 2.2) is not only continuous
but even linear, which makes its analytical treatment particularly easy.
10 2 Application-Specific Mappings and Measuring …

Fig. 2.2 Derivatives of error


functions

For a vector mapping f (x, w), the error (2.4) is a column vector. The vector
product e e is the sum of the squares of the errors of individual output vector elements.
Summing these errors over K training examples result in the error measure

K 
K 
M
E= ek ek = 2
emk (2.5)
k=1 k=1 m=1
Different scaling
 of individual elements of vector patterns can make scaling
weights S = s1 . . . s M appropriate. Also, some training examples may be more
important than others, which can be expressed by additional weights rk . The error
measure (2.5) has then the generalized form

K 
K 
M
E= ek Sek rk = 2
emk sm rk (2.6)
k=1 k=1 m=1

2.1.1 Nonlinear Continuous Mappings

For linear mappings (2.2), explicit solutions for reaching zero in the error measure
(2.5) and (2.6) are known. Their properties have been thoroughly investigated and
some important aspects are discussed in Chap. 4. Unfortunately, most practical ap-
plications deviate to a greater or lesser extent from the linearity assumption. Good
analytical tractability may be a good motivation to accept a linear approximation if
the expected deviations from the linearity assumption are not excessive. However, a
lot of applications will not allow such approximation. Then, some nonlinear approach
is to be used.
Modeling nonlinearities in the mappings can be done in two ways that strongly
differ their application.
2.1 Continuous Mappings 11

The first approach preserves linearity in parameters. The mapping (2.3) is ex-
pressed as
y = Bh (x) (2.7)
with a nonparametric function h (x) which plays the role of the input vector x itself.
In other words, h (x) can be substituted for x in all algebraic relationships valid for
linear systems. This includes also explicit solutions for Mean Square Errors (MSEs)
(2.5) and (2.6).
The function h (x) can be an arbitrary function but a typical choice is a polynomial
in vector x. This is motivated by the well-known Taylor expansion of an arbitrary
multivariate function [7]. This expansion enables an approximation of a multivariate
function by a polynomial of a given order on an argument interval, with known error
bounds.
For a vector x with two elements x1 and x2 , a quadratic polynomial is
    
h x1 x2 = 1 x1 x2 x12 x22 x1 x2 (2.8)
For a vector x with three elements x1 , x2 , and x3 , it is already as complex as
follows:     
h x1 x2 x3 = 1 x1 x2 x3 x12 x22 x32 x1 x2 x1 x3 x2 x3 (2.9)
For a vector x of length N , the length of vector h (x) is
(N − 1) N N 2 + 3N
1+ N + N + =1+ (2.10)
2 2
For a polynomial of order p, the size of vector h (x) grows with the pth power of
N . This is the major shortcoming of the polynomial approach for typical applications
of DS where input variable numbers of many thousands are common. Already with
quadratic polynomials, the input width would increase to millions and more.
Another disadvantage is the growth of higher polynomial powers outside of the
interval covered by the training set—a minor extrapolation may lead to excessively
high output values.
So, modeling the multivariate nonlinearities represented by polynomials is practi-
cal only for low-dimensional problems or problems in which it is justified to refrain
from taking full polynomials (e.g., only powers of individual scalar variables). With
such problems, it is possible to benefit from the existence of analytical optima and
statistically well-founded statements about the properties of the results.
These properties of parameterized mappings linear in parameters have led to
the high interest in more general approximation functions. They form the second
approach: mappings nonlinear in parameters. A prominent example are neural net-
works, discussed in detail in Chap. 3. In spite of intensive research, practical state-
ments about their representational capacity are scarce and overly general, although
there are some interesting concepts such as Vapnik–Chervonenkis dimension [21].
Neural networks with bounded activation functions such as sigmoid do not exhibit
the danger of unbounded extrapolation. They frequently lead to good results if the
number of parameters scales linearly with the input dimension, although the optimal-
ity or appropriateness of their size is difficult to show. Determining their optimum
size is frequently a result of lengthy experiments.
12 2 Application-Specific Mappings and Measuring …

Fitting neural networks to data is done numerically because of missing analyt-


ical solutions. This makes the use of well-behaving error functions such as MSE
particularly important.

2.1.2 Mappings of Probability Distributions

Minimizing the MSE (2.5) or (2.6) leads to a mapping making a good (or even
perfect, in the case of a zero error) forecast of the output vector y. This corresponds
to the statistical concept of point estimation of the expected value of y.
In the presence of an effect unexplained by input variable or of some type of noise,
the true values of the output will usually not be exactly equal to their expected values.
Rather, they will fluctuate around these expected values according to some probability
distribution. If the scope of these fluctuations is different for different input patterns
x, the knowledge of the probability distribution may be of crucial interest for the
application. In this case, it would be necessary to determine a conditional probability
distribution of the output pattern y conditioned on the input pattern x
g (y | x) (2.11)
If the expected probability distribution type is parameterized by parameter vector
p, then (2.11) extends to
g (y | x, p) (2.12)
From the statistical viewpoint, the input/output mapping (2.3) maps the input
pattern x directly to the point estimator of the output pattern y. However, we are
free to adopt a different definition: input pattern x can be mapped to the conditional
parameter vector p of the distribution of output pattern y. This parameter vector
has nothing in common with the fitted parameters of the mapping—it consists of
parameters that determine the shape of a particular probability distribution of the
output patterns y, given an input pattern x. After the fitting process, the conditional
probability distribution (2.12) becomes
g (y, f (x, w)) (2.13)
It is an unconditional distribution of output pattern y with distribution parameters
determined by the function f (x, w). The vector w represents the parameters of the
mapping “input pattern x ⇒ conditional probability distribution parameters p” and
should not be confused with the distribution parameters p themselves. For example,
in the case of mapping f () being represented by a neural network, w would corre-
spond to the network weights. Distribution parameters p would then correspond to
the activation of the output layer of the network for a particular input pattern x.
This can be illustrated on the example of a multivariate normal distribution with
a mean vector m and covariance matrix C. The distribution (2.12) becomes
1 −1  −1
g (y | x, p) = N (m (x) , C (x)) =  e 2 (y−m(x)) C(x) (y−m(x))
(2π ) |C (x)|
N
(2.14)
2.1 Continuous Mappings 13

The vector y can, for example, represent the forecast of temperature and humidity
for the next day, depending on today’s meteorological measurements x. Since the
point forecast would scarcely hit the tomorrow’s state and thus be of limited use, it
will be substituted by the forecast that the temperature/humidity vector is expected
to have the mean m (x) and the covariance matrix C (x), both depending on today’s
measurement vector x. Both the mean vector and the elements of the covariance ma-
trix together constitute the distribution parameter vector p in (2.12). This parameter
vector depends on the vector of meteorological measurements x as in (2.13).
What remains is to choose an appropriate method to find the optimal mappings
m (x) and C (x) which depend on the input pattern x. In other words, we need
some optimality measure for the fit, which is not as simple as in the case of point
estimation with its square error. The principle widely used in statistics is that of
maximum likelihood. It consists of selecting distribution parameters (here: m and C)
such that the probability density value for the given data is maximum.
For a training set pattern pair (xk , yk ), the probability density value is
1 1  −1 (y −m(x ))
 e− 2 (yk −m(xk )) C(xk ) k k (2.15)
(2π ) |C (xk )|
N

For independent samples (xk , yk ), the likelihood of the entire training set is the
product
K
1 1  −1
 e− 2 (yk −m(xk )) C(xk ) (yk −m(xk )) (2.16)
k=1 (2π ) |C (x k )|
N

The maximum likelihood solution consists in determining the mappings m (x)


and C (x) such that the whole product term (2.16) is maximum.
The exponential term in (2.16) suggests that taking the logarithm of the expression
may be advantageous, converting the product to a sum over training patterns. A
negative sign would additionally lead to a minimizing operation, consistently with
the convention for other error measures which are usually minimized. The resulting
term

K
N 1 1
ln 2π + ln |C (xk )| + (yk − m (xk )) C (xk )−1 (yk − m (xk ))
2 2 2
k=1
(2.17)
can be simplified by rescaling and omitting constants to

K
ln |C (xk )| + (yk − m (xk )) C (xk )−1 (yk − m (xk )) (2.18)
k=1
In the special case of known covariance matrices C (xk ) independent from the
input pattern xk , the left term is a sum of constants and (2.18) reduces to a generalized
MSE, an example of which is the measure (2.6). The means m (xk ) are then equal to
the point estimates received by minimizing the MSE.
More interesting is the case of unknown conditional covariance matrices C (xk ).
Its form advantageous for computations is based on the following algebraic laws:
14 2 Application-Specific Mappings and Measuring …

• Every symmetric matrix such as C (xk )−1 can be expressed as a product of a lower
triangular matrix L and its transpose L  , that is, C (xk )−1 = L (xk ) L (xk ) .
• The determinant of a lower diagonal matrix L is the product of its diagonal ele-
ments.
• The determinant of L L  is the square of the determinant of L.
• The inverse L −1 of a lower diagonal matrix L is a lower diagonal matrix and its
determinant is the reciprocal value of the determinant of L.

The expression (2.18) to be minimized becomes




K 
M
 
−2 (ln (lmm (xk ))) + (yk − m (xk )) L (xk ) L (xk ) (yk − m (xk ))
k=1 m=1
(2.19)
The mapping f (x, w) of (2.13) delivers for every input pattern xk the mean vector
m (xk ) and the lower triangular matrix L (xk ) of structure
⎡ ⎤
l11 · · · 0
⎢ .. . . . ⎥
⎣ . . .. ⎦ (2.20)
l M1 · · · l M M
So, for input pattern xk , the output vector yk is forecast to have the distribution
1 −1  
 e 2 (yk −m(x k )) L(x k )L(x k ) (yk −m(x k )) (2.21)
M −2
(2π ) N m=1 lmm (x k )

If the mapping f (x, w) is represented, for example, by a neural network, the


output layer of the network is trained to minimize the log-likelihood (2.19), with the
tuple (m(xk ), L(xk )) extracted from the corresponding elements of the output layer
activation vector.
For higher dimensions M of output pattern vector y, the triangular matrix L has
a number of entries growing with the square of M. Only if mutual independence of
individual output variables can be assumed, L becomes diagonal and has a number
of nonzero elements equal to M.
There are few concepts alternative to multivariate normal distribution if mutual de-
pendencies are essential (in our case, mutual dependencies within the output pattern
vector y). A general approach has been presented by Stützle and Hrycej [19]. How-
ever, if the independence assumption is justified, arbitrary univariate distributions
can be used with specific parameters for each output variable ym . For example, for
modeling the time to failure of an engineering component, the Weibull distribution
[22] is frequently used, with the density function
β−1  β
β y − y
g (y) = e η
(2.22)
η η
2.2 Classification 15

We are then seeking the parameter pair (β(x), η(x)) depending on the input pattern
x such that the log-likelihood over the training set

K
β (xk ) yk yk β(xk )
ln + β (xk ) − ln −
η (xk ) η (xk ) η (xk )
k=1
(2.23)

K
yk β(xk )
= ln β (xk ) − β (xk ) ln η (xk ) + (β (xk ) − 1) ln yk −
η (xk )
k=1
is minimum. The parameter pair can, for example, be the output layer (of size 2)
activation vector  
β η = f (x, w) (2.24)

2.2 Classification

A classification problem is characterized by assigning every pattern a class out of a


predefined class set. Such problems are frequently encountered whenever the result
of a mapping is to be assigned some verbal category. Typical examples are

• images in which the object type is sought (e.g., a face, a door, etc.);
• radar signature assigned to flying objects;
• object categories on the road or in its environment during autonomous driving.

Sometimes, the classes are only discrete substitutes for a continuous scale. Dis-
crete credit scores such as “fully creditworthy” or “conditionally creditworthy” are
only distinct values of a continuous variable “creditworthiness score”. Also, many
social science surveys classify the answers to “I fully agree”, “I partially agree”, “I
am indifferent”, “I partially disagree”, and “I fully disagree”, which can be mapped
to a continuous scale, for example [−1, 1]. Generally, this is the case whenever the
classes can be ordered in an unambiguous way.
Apart from this case with inherent continuity, the classes may be an order-free
set of exclusive alternatives. (Nonexclusive classifications can be viewed as separate
tasks—each nonexclusive class corresponding to a dichotomy task “member” vs.
“nonmember”.) For such class sets, a basic measure of the fit to a given training or test
set is the misclassification error. The misclassification error for a given pattern may
be defined as a variable equal to zero if the classification by the model corresponds to
the correct class and equal to one if it does not. More generally, assigning the object
with the correct class i erroneously to the class j is evaluated by a nonnegative real
number called loss L i j . The loss of a correct class assignment is L ii = 0.
The so-defined misclassification loss is a transparent measure, frequently directly
reflecting application domain priorities. By contrast, it is less easy to make it opera-
tional for fitting or learning algorithms. This is due to its discontinuous character—a
class assignment can only be correct or wrong. So far, solutions have been found
only for special cases.
16 2 Application-Specific Mappings and Measuring …

Fig. 2.3 Two classes with linear separation

This discontinuity represents a difficulty when searching for optimal classifica-


tion mappings. For continuous mappings, there is a comfortable situation that a fit
measure such as MSE is also one that can be directly used in numerical optimization.
Solving the optimization problem is identical with solving the fitting problem. For
classification tasks, this comfort cannot be enjoyed. What we want to reach is not
always identical with what we can efficiently optimize. To bridge this gap, various
approaches have been proposed. Some of them are sketched in the following sections.

2.2.1 Special Case: Two Linearly Separable Classes

Let us consider a simple problem with two classes and two-dimensional patterns
[x1 , x2 ] as shown in Fig. 2.3. The points corresponding to Class 1 and Class 2
can be completely separated by a straight line, without any misclassification. This
is why such classes are called linearly separable. The attainable misclassification
error is zero.
The existence of a separating line guarantees the possibility to define regions in
the pattern vector space corresponding to individual classes. What is further needed
is a function whose value would indicate the membership of a pattern in a particular
class. Such function for the classes of Fig. 2.3 is that of Fig. 2.4. Its value is unity
for patterns from Class 1 and zero for those from Class 2.
Unfortunately, this function has properties disadvantageous for treatment by nu-
merical algorithms. It is discontinuous along the separating line and has zero gradient
elsewhere. This is why it is usual to use an indicator function of type shown in Fig. 2.5.
It is a linear function of the pattern variables. The patterns are assigned to Class
1 if this function is positive and to Class 2 otherwise.
Many or even the most class pairs cannot be separated by a linear hyperplane. It
is not easy to determine whether they can be separated by an arbitrary function if the
2.2 Classification 17

Fig. 2.4 Separating function

Fig. 2.5 Continuous


separating function

family of these functions is not fixed. However, some classes can be separated by
simple surfaces such as quadratic ones. An example of this is given in Fig. 2.6. The
separating curve corresponds to the points where the separating function of Fig. 2.7
intersects the plane with y = 0.
The discrete separating function such as that of Fig. 2.4 can be viewed as a
nonlinear step function of the linear function of Fig. 2.5, that is,

 1 for b x ≥ 0
s bx = (2.25)
0 for b x < 0
18 2 Application-Specific Mappings and Measuring …

Fig. 2.6 Quadratic separation

Fig. 2.7 Quadratic


separation function

To avoid explicitly mentioning the absolute term, it will be assumed that the last
element of input pattern vector x is equal to unity, so that
⎡ ⎤
x1 ⎡ ⎤
⎢ .. ⎥  x1
  ⎢ ⎥  ⎢ ⎥
b x = b1 · · · b N −1 b N ⎢ . ⎥ = b1 · · · b N −1 ⎣ ... ⎦ + b N
⎣x N −1 ⎦
x N −1
1
The misclassification sum for a training set with input/output pairs (xk , yk ) is
equal to
K
s b xk − yk
2
E= (2.26)
k=1
2.2 Classification 19

Here, yk is the class indicator of the kth training pattern with values 0 or 1. For
most numerical minimization methods for error functions E, the gradient of E with
regard to parameters b is required to determine the direction of descent towards low
values of E. The gradient is

∂E  K
ds
=2 s b xk − yk xk (2.27)
∂b dz
k=1
with z being the argument of function s (z).
However, the derivative of the nonlinear step function (2.25) is zero everywhere
except for the discontinuity at z = 0 where it does not exist. To receive a useful
descent direction, the famous perceptron rule [16] has used a gradient modification.
This pioneering algorithm iteratively updates the weight vector b in the direction of
the (negatively taken) modified gradient

∂E  K
= s b xk − yk xk (2.28)
∂b
k=1
ds
This modified gradient can be viewed as (2.27) with dz substituted by unity (the
derivative of linear function s (z) = z). Taking a continuous gradient approxima-
tion is an idea used by optimization algorithms for non-smooth functions, called
subgradient algorithms [17].
The algorithm using the perceptron rule converges to zero misclassification rate
if the classes, as defined by the training set, are separable. Otherwise, convergence
is not guaranteed.
An error measure focusing on critical patterns in the proximity of separating
line is used by the approach called the support vector machine (SVM) [2]. This
approach is looking for a separating line with the largest orthogonal distance to the
nearest patterns of both classes. In Fig. 2.8, the separating line is surrounded by
the corridor defined by two boundaries against both classes, touching the respective
nearest points. The goal is to find a separating line for which the width of this corridor
is the largest. In contrast to the class indicator of Fig. 2.4 (with unity for Class 1
and zero for Class 2), the support vector machine rule is easier to represent with a
symmetric class indicator y equal to 1 for one class and to −1 for another one. With
this class indicator and input pattern vector containing the element 1 to provide for the
absolute bias term, the classification task is formulated as a constrained optimization
task with constraints

yk b xk ≥ 1 (2.29)
If these constraints are satisfied, the product b xk is always larger than 1 for Class
1 and smaller than −1 for Class 2.
The separating function b x of (2.29) is a hyperplane crossing the x1 /x2 -coordinates
plane at the separating line (red line in Fig. 2.8). At the boundary lines, b x is equal
to constants larger than 1 (boundary of Class 1) and smaller than −1 (boundary
of Class 2). However, there are infinitely many such separating functions. In the
20 2 Application-Specific Mappings and Measuring …

Fig. 2.8 Separating principle of a SVM

Fig. 2.9 Alternative separating functions—cross-sectional view

cross section perpendicular to the separating line (i.e., viewing the x1 /x2 -coordinates
plane “from aside”), they may appear as in Fig. 2.9.
There are infinitely many such hyperplanes (appearing as dotted lines in the cross
section of Fig. 2.9), some of which becoming very “steep”. The most desirable
variant would be that exactly touching the critical points of both classes at a unity
“height” (solid line). This is why the optimal solution of the SVM is such that it has
the minimum norm of vector b:

min b (2.30)


The vector norm is a quadratic function of vector elements. So, the constraints
(2.29) together with the objective function (2.30) constitute a quadratic minimization
problem with constraints, solvable with modern numerical methods. Usually, the dual
form having the same optimum as the problem in (2.29) and (2.30) is solved.
Both the perceptron rule and the SVM are originally designed for linearly sepa-
rable classes. In this case, the optimum corresponds to the perfect separation and no
misclassification occurs. With linearly separable classes, the measure of success is
2.2 Classification 21

simple: “separated” (successful fit) and “non-separated” (failing to fit). The absence
of intermediary results makes the problem of discontinuous misclassification error
or loss irrelevant—every separation is a full success.

2.2.2 Minimum Misclassification Rate for Two Classes

Unfortunately, separability or even linear separability is rather scarce in real-world


classification problems. Then, the minimization of the inevitable misclassification
loss is the genuine objective. The perceptron rule and the SVM have extensions for
non-separable classes but they do not perform genuine misclassification minimiza-
tion although the results may be acceptable.
The group of methods explicitly committed to this goal are found in the statistical
discriminant analysis. The principle behind the typically applied approach is to de-
termine the probability of a pattern vector to be a member of a certain class. If these
probabilities are known (or estimated in a justified way) for all classes in questions,
it is possible to choose the class with the highest probability. If the probability that a
pattern is a member of the ith class is Pi , the probability of being assigned a false class
j = i is 1 − Pi . If every pattern is assigned to the class with the highest probability,
the probability of misclassification (which is proportional to the misclassification
error) is at its minimum.
With the knowledge of a probability distribution of the patterns of each class, this
assessment can be made. In other words, the classification problem is “reduced” to
the task of assessment of these probability distributions for all classes. The quotation
marks around “reduced” suggest that this task is not easy. On the contrary, it is
a formidable challenge since most real-world pattern classes follow no analytical
distribution used in the probability theory.
Let us consider the case of two classes, the patterns of each of which are normally
distributed (Gaussian distribution), with mean vector m i , covariance matrix Ci , i =
1, 2, and pattern vector length N :
1 −1  −1
N (m i , Ci ) =  e 2 (x−m i ) Ci (x−m i ) (2.31)
(2π ) N |Ci |
The density (2.31) can be viewed as a conditional density f (x | i) given the class
i. The classes may have different prior probabilities pi (i.e., they do not occur equally
frequently in reality). Bayesian posterior probability of pattern x being the ith class
is then
f (x | i) pi
Pi = (2.32)
f (x | 1) p1 + f (x | 2) p2
Which class has a higher probability can be tested by comparing the ratio
P1 f (x | 1) p1
= (2.33)
P2 f (x | 2) p2
with unity, or, alternatively, comparing its logarithm
ln (P1 ) − ln (P2 ) = ln ( f (x | 1)) − ln ( f (x | 2)) + ln ( p1 ) + ln ( p2 ) (2.34)
with zero.
22 2 Application-Specific Mappings and Measuring …

Substituting (2.31)–(2.34) results in



1 1
ln  − (x − m 1 ) C1−1 (x − m 1 )
(2π ) |C1 |
N 2

1 1
− ln  + (x − m 2 ) C2−1 (x − m 2 )
(2π ) |C2 |
N 2
+ ln ( p1 ) − ln ( p2 ) (2.35)
1 1 1  
= ln (|C2 |) − ln (|C1 |) + x  C2−1 − C1−1 x
2 2 2
  1   −1 
 −1  −1
+ m 1 C1 − m 2 C2 x − m 1 C1 m 1 − m 2 C2−1 m 2
2
+ ln ( p1 ) − ln ( p2 )
which can be made more transparent as
x  Ax + b x + d (2.36)
with
1 −1
A= C − C1−1
2 2
b = m 1 C1−1 − m 2 C2−1
1 1 1   −1 
d = ln (|C2 |) − ln (|C1 |) − m 1 C1 m 1 − m 2 C2−1 m 2 + ln ( p1 ) − ln ( p2 )
2 2 2
A Bayesian optimum decision consists in assigning the pattern to Class 1 if
the expression (2.36) is positive and to Class 2 if it is negative.
Without prior probabilities pi , the ratio (2.33) is the so-called likelihood ratio
which is a popular and well elaborated statistical decision criterion. The decision
function (2.36) is then the same, omitting the logarithms of pi .
The criterion (2.36) is a quadratic function of pattern vector x. The separating
function is of type depicted in Fig. 2.7. This concept can be theoretically applied to
some other distributions beyond the Gaussian [12].
The discriminant function (2.36) is dedicated to classes with normally distributed
classes. If the mean vectors and the covariance matrices are not known, they can easily
be estimated from the training set, as sample mean vectors and sample covariance
matrices, with well-investigated statistical properties. However, the key problem is
the assumption of normal distribution itself. It is easy to imagine that this assumption
is rather scarcely strictly satisfied. Sometimes, it is even clearly wrong.
Practical experience has shown that the discriminant function (2.36) is very sensi-
tive to deviations from distribution normality. Paradoxically, better results are usually
reached with a further assumption that is even less frequently satisfied: that of a com-
mon covariance matrix C identical for both classes. This is roughly equivalent to the
same “extension” of both classes in the input vector space.
2.2 Classification 23

For Gaussian classes with column vector means m 1 and m 2 , and common co-
variance matrix C, matrix A and some parts of the constant d become zero. The
discriminant function becomes linear:
b x + d > 0
with
b = (m 1 − m 2 ) C −1
1 p1
d = − b (m 1 + m 2 ) + ln
2 p2 (2.37)
1 p1
= − (m 1 + m 2 ) C −1 (m 1 + m 2 ) + ln
2 p2
This linear function is widely used in the linear discriminant analysis.
Interestingly, the separating function (2.37) can, under some assumptions, be
received also with a least squares approach. For simplicity, it will be assumed that
the mean over both classes m 1 p2 + m 2 p2 is zero. Class 1 and Class 2 are
coded by 1 and −1, and the pattern vector x contains 1 at the last position.
The zero gradient is reached at
b X X  = y X  (2.38)
By dividing both sides by the number of samples, matrices X and X X contain
sample moments (means and covariances). Expected values are
   
1   1 
E b XX = E yX (2.39)
K K
The expression X X  corresponds to the sample second moment matrix. With the
zero mean, as assumed above, it is equal to the sample covariance matrix. Every
covariance matrix over a population divided into classes can be decomposed to the
intraclass covariance C (in this case, identical for both classes) and the interclass
covariance  
M = m1 m2
 
p1 0
P= (2.40)
0 p2
Ccl = M P M 
This can be then rewritten as
 
 C + MPM 0
  
b = p1 m 1 − p2 m 2 p1 − p2 (2.41)
0 1
resulting in
 
  C + Ccl 0 −1
b = p1 m 1 − p2 m 2 p1 − p2
0 1
 −1 
  C + Ccl 0 (2.42)
= p1 m 1 − p2 m 2 p1 − p2
0 1
   
= p1 m 1 − p2 m 2 C + Ccl −1 p1 − p2
24 2 Application-Specific Mappings and Measuring …

It is interesting to compare the linear discriminant (2.37) with least square solution
(2.37) and (2.42). With an additional assumption of both classes having identical prior
probabilities p1 = p2 (and identical counts in the training set), the absolute term of
both (2.37) and (2.42) becomes zero. The matrix Ccl contains covariances of only two
classes and is thus of maximum rank two. The additional condition of overall mean
equal to zero reduces the rank to one. This results in least squares-based separating
vector b to be only rescaled in comparison with that of separating function (2.37).
This statement can be inferred in the following way.
In the case of identical prior probabilities of both classes, the condition of zero
mean of distribution of all patterns is m 1 +m 2 = 0, or m 2 = −m 1 . It can be rewritten
as m 1 = m and m 2 = −m with the help of a single column vector of class means m.
The difference of both means is m 1 − m 2 = 2m. The matrix Ccl is
1   1
Ccl = m1 m2 m1 m2 = m 1 m 1 + m 2 m 2 = mm  (2.43)
2 2
with rank equal to one—it is an outer product of only one vector m with itself.
The equation for separating function b of the linear discriminant is
bC = 2m  (2.44)
while for separating function bLS of least squares, it is
bLS (C + Ccl ) = 2m  (2.45)
Let us assume the proportionality of both solutions by factor d:
bLS = db (2.46)
Then
db (C + Ccl ) = 2dm  + 2dm  C −1 Ccl = 2m  (2.47)
or
1−d 
m  C −1 Ccl = m  C −1 mm  = m = em  (2.48)
d
with
1−d
e= (2.49)
d
and
1
d= (2.50)
1+e
The scalar proportionality factor e in (2.48) can always be found since Ccl = mm 
is a projection operator to a one-dimensional space. It projects every vector, i.e.,
also the vector m  C −1 , to the space spanned by vector m. In other words, these
two vectors ale always proportional. Consequently, a scalar proportionality factor
d for separating functions can always be determined via (2.50). This means that
proportional separating functions are equivalent since they separate identical regions.
The result of this admittedly tedious argument is that the least square solution
fitting the training set to the class indicators 1 and −1 is equivalent with the optimum
linear discriminant, under the assumption of
2.2 Classification 25

Fig. 2.10 Lateral projection


of a linear separating
function (class indicators 1
and −1)

• normally distributed classes;


• identical covariance matrix of both classes;
• and classes with identical prior probabilities.

This makes the least squares solution interesting since it can be applied without
assumptions about the distribution—of course with the caveat that is not Bayesian
optimal for other distributions. This seems to be the foundation of the popularity of
this approach beyond the statistical community, for example, in neural network-based
classification.
Its weakness is that the MSE reached cannot be interpreted in terms of misclassifi-
cation error—we only know that in the MSE minimum, we are close to the optimum
separating function. The reason for this lack of interpretability is that the function
values of the separating function are growing with the distance from the hyperplane
separating both classes while the class indicators (1 and −1) are not—they remain
constant at any distance. Consequently, the MSE attained by optimization may be
large even if the classes are perfectly separated. This can be seen if imagining a “lat-
eral view” of the vector space given in Fig. 2.10. It is a cross section in the direction
of the class separating line. The class indicators are constant: 1 (Class 1 to the
left) and −1 (Class 2 to the right).
More formally, the separating function (for the case of separable classes) assigns
the patterns, according to the test b x + d > 0 for Class 1 membership, to the
respective correct class. However, the value of b x + d is not equal to the class
2
indicator y (1 or −1). Consequently, the MSE b x + d − y is far away from zero
in the optimum. Although alternative separating functions with identical separating
lines can have different slopes, no one of them can reach zero MSE. So, the MSE
does not reflect the misclassification rate.
This shortcoming can be alleviated by using a particular nonlinear function of
the term b x + d. Since this function is usually used in the form producing class
26 2 Application-Specific Mappings and Measuring …

Fig. 2.11 Lateral projection


of a linear separating
function (class indicators 1
and 0)

Fig. 2.12 Logistic (sigmoid)


function

indicators 1 for Class 1 and zero for Class 2, it will reflect the rescaled linear
situation of Fig. 2.11.
The nonlinear function is called logistic or logit function in statistics and econo-
metrics. With neural networks, it is usually referred to as sigmoid function, related
via rescaling to tangent hyperbolicus (tanh). It is a function of scalar argument z:
1
y = s (z) = (2.51)
1 + e−z
This function is mapping the argument z ∈ (−∞, ∞) to the interval [0, 1], as
shown in Fig. 2.12.
Applying (2.51) to the linear separating function b x+d, that is, using the nonlinear
separating function
1
y = s b x + d = −(b  x+d) (2.52)
1+e
will change the picture of Fig. 2.11 to that of Fig. 2.13. The forecast class indicators
(red crosses) are now close to the original ones (blue and green circles).
The MSE is
s b x + d − y
2
(2.53)
For separable classes, MSE can be made arbitrarily close to zero, as depicted in
Fig. 2.14. The proximity of the forecast and true class indicators can be increased
2.2 Classification 27

Fig. 2.13 Lateral projection


of a logistic separating
function (class indicators 1
and 0)

Fig. 2.14 A sequence of


logistic separating functions

by increasing the norm of weight vector b. This unbounded norm is a shortcoming


of this approach if used with perfectly separable classes.
For non-separable classes, this danger disappears. There is an optimum in which
the value of the logistic class indicator has the character of probability. For patterns
in the region where the classes overlap, the larger its value, the more probable is the
membership in Class 1. This is illustrated as the lateral projection in Fig. 2.15.
Unfortunately, in contrast to a linear separating function and Gaussian classes,
minimizing the MSE with the a logistic separating function has no guaranteed opti-
mality properties with regard to neither misclassification loss nor class membership
probability.
How to follow the probabilistic cue more consequently is discussed in the follow-
ing Sect. 2.2.3.
28 2 Application-Specific Mappings and Measuring …

Fig. 2.15 Logistic


separating function with two
non-separable classes

2.2.3 Probabilistic Classification

First, it has to be mentioned that misclassification rate itself is a probabilistic concept.


It can be viewed as the probability that a pattern is erroneously assigned to a wrong
class.
The approach discussed in this section adopts another view. With two classes,
the probability p can be assessed that a pattern belongs to Class 1 while the
probability of belonging to Class 2 is complementary, that is 1 − p. For a given
pattern, p is a conditional probability conditioned by this pattern:
p (x) = P (y = 1 | x) (2.54)
From the probabilistic point of view, the class membership of pattern x is a ran-
dom process, governed by Bernoulli distribution—a distribution with exactly the
properties formulated above: probability p of membership in Class 1 and 1 − p
for the opposite. The probability is a function of input pattern vector x.
The classification problem consists in finding a function f (x, w), parameterized
by vector w, which is a good estimate of true probability p of membership of pattern
x in Class 1. This approach is a straightforward application to the principle ex-
plained in Sect. 2.1.2. The distribution concerned here is the Bernoulli distribution
with a single distribution parameter p.
For a pattern vector xk and a scalar class indicator yk , the likelihood of the prob-
ability p resulting as a function f (x, w) is
f (xk , w), yk = 1
(2.55)
1 − f (xk , w), yk = 0
This can be written more compactly as
f (xk , w) yk (1 − f (xk , w))1−yk (2.56)
2.2 Classification 29

where the exponents yk and 1 − yk acquire values 0 or 1 and thus “select” the correct
alternative from (2.55).
For a sample (or training set) of mutually independent samples, the likelihood
over this sample is the product

K
f (xk , w) yk (1 − f (xk , w))1−yk
k=1
(2.57)

K 
K
= ( f (xk , w)) (1 − f (xk , w))
k=1,yk =1 k=1,yk =0

Maximizing (2.57) is the same as minimizing its negative logarithm



K
L=− yk ln f (xk , w) + (1 − yk ) ln (1 − f (xk , w))
k=1
(2.58)

K 
K
=− ln ( f (xk , w)) − ln (1 − f (xk , w))
k=1,yk =1 k=1,yk =0

If the training set is a representative sample from the statistical population as-
sociated with pattern xk , the expected value of likelihood per pattern L/K can be
evaluated. The only random variable in (2.58) is the class indicator y, with probability
p of being equal to one and 1 − p of being zero:

E [L/K ] = E [yk ln ( f (xk , w)) + (1 − yk ) ln (1 − f (xk , w))]


(2.59)
= p (xk ) ln ( f (xk , w)) + (1 − p (xk )) ln (1 − f (xk , w))
The minimum of this expectation is where its derivative with regard to the output
of mapping f () is zero:
∂ E [L] ∂ ∂
= p (xk ) ln ( f (xk , w)) + (1 − p (xk )) ln (1 − f (xk , w))
∂f ∂f ∂f
p (xk ) 1 − p (xk )
= − =0 (2.60)
f (xk , w) 1 − f (xk , w)
p (xk ) (1 − f (xk , w)) − (1 − p (xk )) f (xk , w) = 0
f (xk , w) = p (xk )

This means that if the mapping f () is expressive enough to be parameterized to


hit the conditional class 1 probability for all input patterns x, this can be reached by
minimizing the log-likelihood (2.58). In practice, a perfect fit will not be possible.
In particular, with a mapping f (x) = Bx, it is clearly nearly impossible because
of outputs Bx that would probably fail to remain in the interval (0, 1). Also, with a
logistic regression (2.52), it will only be an approximation for which no analytical
solution is known. However, iterative numerical methods lead frequently to good
results.
30 2 Application-Specific Mappings and Measuring …

As an alternative to the maximum likelihood principle, a least squares solution


minimizing the square deviation between the forecast and the true class indicator
can be considered.
For a sample (xk , yk ), the error is
ek = ( f (xk , w) − yk )2 (2.61)
The mean value of the error over the whole sample population, that is, MSE, is
 
E = E ( f (xk , w) − yk )2
(2.62)
= p (xk ) ( f (xk , w) − 1)2 + (1 − p (xk )) ( f (xk , w))2
This is minimum for values of f () satisfying
∂E ∂
= p (xk ) ( f (xk , w) − 1)2 + (1 − p (xk )) ( f (xk , w))2
∂f ∂f
= 2 p (xk ) ( f (xk , w) − 1) + 2 (1 − p (xk )) f (xk , w) (2.63)
= 2 ( f (xk , w) − p (xk ))
=0
or
f (xk , w) = p (xk ) (2.64)
Obviously, minimizing MSE is equivalent to the maximum likelihood approach,
supposed the parameterized approximator f (x, w) is powerful enough for capturing
the dependence of class 1 probability on the input pattern vector.
Although the least square measure is not strictly identical with the misclassifi-
cation loss, they reach their minimum for the same parameter set (assuming the
sufficient representation power of the approximator, as stated above). In asymptotic
terms, the least squares are close to zero if the misclassification loss is close to zero,
that is, if the classes are separable. However, for strictly separable classes, there is
a singularity—the optimum parameter set is not unambiguous and the parameter
vector may grow without bounds.

With a parameterized approximator f (x, w) that can exactly compute the class
probability for a given pattern x and some parameter vector w, the exact fit is
at both the maximum of likelihood and the MSE minimum (i.e., least squares).
Of course, to reach this exact fit, an optimization algorithm that is capable of
finding the optimum numerically has to be available. This may be difficult for
strongly nonlinear approximators.
Least squares with logistic activation function seem to be the approach to
classification that satisfies relatively well the requirements formulated at the
beginning of Sect. 2.2.
Another random document with
no related content on Scribd:
On but encore un coup à sa santé et à la nôtre, et l’on se remit
en marche à travers les bois, causant, riant, chantant, contents de
vivre et de nous sentir un même cœur, un cœur léger comme
l’oiselet que notre gaîté faisait envoler, limpide comme le ruisseau
qui gazouillait sur les cailloux le long du sentier.
Quand le Père s’aperçut que la route commençait à nous paraître
longuette, il nous apprit à fabriquer instantanément, avec une simple
cupule de gland, convenablement serrée entre les dernières
phalanges de l’index et du médius, un fifre naturel. Nous
organisâmes sur place une marche militaire, qui mit en émoi tous les
échos endormis de la vallée et nous fit complètement oublier la
fatigue.
Une brave fermière, au sortir de la forêt, nous offrit en réconfort
un bol de lait délicieux, et bientôt nous reposions nos membres
rompus (nous ne le sentîmes qu’alors), sur les banquettes de bois
du train, qui nous parurent douces.
En route, Louis me dit à l’oreille :
« Excellence, voilà encore un bon usage à introduire dans votre
Université !
— Je n’y manquerai pas, dès qu’elle aura des Congréganistes
comme toi. »

21 mai : Pentecôte. — Louis a fêté aujourd’hui avec émotion le


premier anniversaire de son retour à Dieu. Dans la journée, au nom
de sa mère (je n’ai pas osé leur faire le chagrin de refuser), il m’a
prié d’accepter comme souvenir un très beau petit Christ en vieil
argent, avec date et signatures gravées au revers. L’excellent cœur !
Dieu ne pouvait pas le laisser dans la voie où il se perdait.

28 mai. — Hier samedi soir, l’Association de St.-X. a clôturé ses


réunions de semestre par une conférence de son Président, dont le
sujet a très particulièrement intéressé les plus jeunes auditeurs,
philosophes et rhétoriciens. C’était « la jeunesse et ses
détracteurs. »
Les détracteurs, soit dit en passant, ne venaient guère là que par
manière de précaution oratoire : car, en réalité, ce discours, quoique
fort discret et fort délicat, renfermait à l’adresse des jeunes moins de
compliments que de leçons. C’est précisément ce qui lui donnait sa
valeur pratique.
On reproche donc à la jeunesse chrétienne de dix-huit à vingt-
cinq ans (il ne s’agit que de celle-là) de ne rien faire pour la cause de
Dieu. Formulé d’une façon aussi générale, le reproche paraît
excessif : l’orateur n’a pas grand’peine à le prouver, en faisant un
rapide tableau des œuvres d’assistance, d’instruction, de
moralisation, auxquelles se dévouent nos camarades sur tous les
points de la France.
Mais il faut l’avouer — et voici déjà la leçon — parmi ceux qui
font quelque chose pour Dieu et le prochain, plusieurs pourraient
faire davantage, s’ils avaient moins peur de sacrifier un peu de leur
plaisir ou de leur loisir, moins peur aussi de se compromettre
franchement pour la bonne cause. Égoïsme et respect humain.
Mais surtout, il y a trop de jeunes gens qui, une fois libérés du
collège, ne songent même pas à chercher dans l’action chrétienne,
avec un préservatif salutaire, le bon emploi des dons qu’ils ont reçus
de Dieu. A qui la faute ?
A leurs familles ? Non ; car, étant ce qu’elles sont d’ordinaire,
elles ne pourraient voir qu’avec bonheur et fierté leurs fils se faire les
champions dévoués de la religion et de la patrie.
A leurs maîtres ? Non, encore une fois. Par devoir d’état et par
amour paternel, ils ont mis tout en œuvre pour développer dans
l’esprit de leurs élèves les hautes pensées, dans leur cœur les
généreux désirs, et, après le collège, ils sont encore là pour
recueillir, diriger et soutenir les bons vouloirs.
« Je sais bien, ajoute l’orateur, que les élèves des Jésuites sont
parfois accusés de n’avoir pas d’initiative pour le bien, et l’on en
cherche la cause dans cette compression perpétuelle qu’exercerait
sur leur caractère l’habitude d’une discipline inflexible. A cette
affirmation j’oppose une réponse très simple, par voie de
comparaison. Il n’existe pas d’Ordre religieux qui soumette ses
membres à une obéissance aussi parfaite que la Compagnie de
Jésus : en connaissez-vous un qui soit plus militant ? Fils d’un
soldat, les Jésuites sont restés soldats — leurs ennemis le savent
bien — et c’est en obéissant qu’ils apprennent à combattre. Jeunes
gens qui m’écoutez, faites comme eux. Quand on comprime un
ressort de bon acier, on ne l’affaiblit pas : on lui donne le moyen de
prouver sa force. »
« Et pour ne pas sortir de la comparaison, savez-vous pourquoi
tant d’anciens élèves ne font rien pour la cause de Dieu ? C’est
parce que le ressort est détendu et qu’il ne veut plus de
compression.
« Le premier danger de cette liberté après laquelle soupire le
collégien, c’est la détente, qui ne tardera pas, si l’on n’y veille, à
amener le laisser-aller, l’amour égoïste du repos et, par suite,
l’inertie pour le bien qui demanderait un effort…
« Le second danger, c’est l’entraînement d’un milieu frivole et
corrompu, tels qu’on les trouve dans les grandes villes et dans les
petites, sans avoir besoin même de les chercher. Or, s’il ne veut pas
se laisser saisir par un de ces mauvais courants qui mènent aux
abîmes, le jeune homme, aujourd’hui plus que jamais, n’a qu’une
ressource : entrer résolument dans un courant contraire, se faire
entraîner au bien, s’associer aux hommes d’action chrétienne. »
Mais j’essaierais en vain de reproduire ce vigoureux discours.
J’abrège. Dans sa seconde partie, l’orateur établit que le jeune
homme qui prétend faire quelque chose de sérieux pour la cause de
Dieu ne doit pas, de propos délibéré, voir dans les œuvres dites de
jeunesse le dernier terme de son activité. Instruire des enfants,
amuser des patronages ou des cercles, assister les malheureux,
sont choses louables, mais insuffisantes. Quand on a du cœur, on
regarde plus haut et plus loin ; on ne recule pas (car toutes les
nobles ambitions sont permises à nos jeunes ardeurs) devant l’idée
d’être un jour un homme d’œuvres comme Hervé-Bazin, un orateur
comme Montalembert, un homme d’État comme Garcia Moreno. Ne
ferait-on qu’approcher de pareils modèles, ce serait déjà un grand
mérite et un grand honneur.
« Mais pour en arriver là, mes amis, il faut vouloir sincèrement,
ardemment, persévéramment, deux choses : mettre Dieu dans toute
votre vie de jeune homme, afin qu’il vous préserve des
amollissements du mal et vous conserve les énergies du bien, — et
puis travailler sur vous-mêmes, développer méthodiquement tout ce
que Dieu vous a donné d’intelligence, de savoir-faire et de cœur…
Bref, il faut former en vous à la fois l’homme de bien et l’homme
d’action. A ces deux conditions, vous aurez le droit de compter sur la
grâce de Dieu et sur le succès. »
J’ai écouté tout cela avec un intérêt très personnel et, comme à
la conférence du comte de Mun, il m’a semblé qu’à défaut de
vocation religieuse, un assez vaste champ resterait encore ouvert à
mon activité, même si je n’atteignais pas tout à fait Montalembert ou
Garcia Moreno !
L’éloquence me souriait ; pour la politique, il faudrait « voir unm
peu », comme disait le bon Frère dépensier de l’an passé, quand on
lui réclamait un supplément de dessert que ses moyens ne
comportaient peut-être pas.

4 juin. — Nos petits pauvres ont fait dimanche dernier leur


première communion à la paroisse. Aujourd’hui ils viennent au
collège, tout fiers des beaux costumes qu’ils nous doivent et
accompagnés de leurs familles. Messieurs leurs Catéchistes les
introduisent dans la chapelle, aux places des élèves. Le P. Directeur,
après quelques bons avis aux enfants et aux parents, dit la messe
d’action de grâces, pendant laquelle plusieurs artistes de bonne
volonté charment ces braves gens de leurs plus beaux accords.
Au sortir de la chapelle, devant le portail, le P. Directeur proclame
solennellement les places d’excellence pour toute l’année, et chaque
enfant, selon son rang, vient recevoir du P. Recteur un souvenir
pieux et deux baisers. L’un des gamins que le Père avait oublié
d’embrasser, ne manqua pas de revenir à la fin, conduit par sa mère,
pour réclamer son dû. La cérémonie se termine par une distribution
de dragées, que tous, jeunes et vieux, acceptent avec plaisir, et l’on
s’en retourne content, après avoir chaleureusement remercié les
Pères et ces Messieurs.
Après vêpres, nos enfants partent pour la campagne, sur deux
rangs, sous la conduite du Père et des Catéchistes, escortant une
charrette précieuse, qu’il ne ferait pas bon attaquer. Elle porte leur
goûter.
Sur l’herbe de la villa, jeux variés, où le problème du
rapprochement des classes reçoit une solution facile. Il en est de
même au goûter qui suit : les Catéchistes président les tables et font
eux-mêmes honneur aux plats avec un appétit aussi démocratique
que celui des enfants. Le Président toaste, une fois encore, à la
santé de tout le monde ; chacun orne sa boutonnière et sa casquette
d’une fleur cueillie au jardin des Pères et l’on reprend gaiement le
chemin de la ville.
Avec mon petit toast a expiré ma présidence : elle m’avait valu
quelques joies innocentes, sans parler des honneurs. Un Président
de catéchisme d’enfants pauvres n’est pas encore un Montalembert
ni un Garcia Moreno : mais petit poisson deviendra grand et tout
chemin conduit à Rome.

9 juin. — Procession solennelle dans les cours du collège, en


l’honneur du Sacré-Cœur. En avant, derrière la croix, marchent sur
deux rangées les divisions d’élèves, avec leurs bannières de
Congrégation et de classe. Le clergé en ornements d’or et de soie
précède immédiatement le dais, sous lequel le P. Recteur porte le
Saint-Sacrement, suivi des premiers communiants et des fidèles.
Le cortège s’avance lentement, au milieu de la verdure et des
fleurs, des draperies et des écussons, des guirlandes et des
oriflammes aux couleurs variées. Chaque division s’est ingéniée à
décorer ses frontières et à dresser partout de petits autels
pittoresques, où tout, jusqu’aux instruments de jeu, se convertit en
hommage au divin Maître qui passe.
Dans la grande cour, dominée par la statue de Notre-Dame, se
dresse le reposoir principal. Notre-Seigneur y monte, escorté de ses
prêtres, et là, exposé entre les lumières et les fleurs, il appelle à lui
toutes les adorations. En bas, les divisions forment un vaste cercle,
encadrant les soixante enfants de chœur, qui, selon de savantes
figures, balancent leurs encensoirs et jettent des roses effeuillées.
Puis le Tantum ergo éclate, chanté par plusieurs centaines de voix et
accompagné des sonores accents de la fanfare : vrai chant de
triomphe qui vous empoigne au cœur et vous arrache les larmes.
Quand le prêtre a récité l’oraison, tous les genoux plient et la
bénédiction du Très-Haut descend sur la foule profondément
recueillie.
De retour à la chapelle, avant que le tabernacle reprenne le divin
prisonnier, toute l’assistance implore sa miséricorde pour son
peuple : Parce, Domine, parce populo tuo ! Et pendant que la longue
théorie des enfants de chœur et des prêtres s’écoule avec une
majestueuse lenteur vers les sacristies, les élèves jettent encore
vers le ciel avec un élan superbe le refrain patriotique et chrétien :

Dieu de clémence,
O Dieu vainqueur,
Sauvez Rome et la France,
Au nom du Sacré-Cœur !

Les incrédules et les sectaires peuvent rire de ces manifestations


pieuses, renfermées dans les murs d’un collège : ils ne savent pas
ce que vaut la prière d’une seule âme qui aime vraiment Dieu, ni
combien eux-mêmes pèseront peu devant lui, le jour où il voudra les
balayer d’un souffle.
Quant à moi, cette belle fête a augmenté ma confiance en Dieu
et affermi ma résolution de le servir comme il voudra que je le serve.

13 juin. — Ce soir, ouverture de la retraite. Je ne la vois pas


venir sans anxiété : comment pourrait-il en être autrement,
puisqu’elle doit décider de l’orientation de toute ma vie ? Mais la paix
est promise dès ce monde aux hommes de bon vouloir : j’y porterai
le mien tout entier et j’espère que tout ira bien. Mon directeur me l’a
promis et je compte sur les prières de ceux qui m’aiment.
D’ailleurs, depuis quelques semaines, j’ai beaucoup réfléchi et je
pense avoir en main les éléments indispensables d’un bon choix : la
grâce de la retraite fera le reste.

18 juin. — C’est fait et réglé : je ne serai pas jésuite.


Oh ! je n’en ai pas pris mon parti sans lutte et sans déchirement
de cœur. Le P. Prédicateur nous avait successivement dépeint d’une
manière si convaincante le grand devoir du salut éternel, les
difficultés qu’un jeune homme rencontre dans le monde
d’aujourd’hui, la sublimité du sacrifice de tout soi-même à la gloire
de Dieu et au bien des âmes, que j’ai senti renaître en moi le dégoût
des choses matérielles et le désir de prendre le chemin à la fois le
plus sûr et le plus généreux. Tout ce que le Père nous disait là-
dessus, mon esprit le voyait comme réalisé d’avance dans mon ami
Jean ; je me figurais son bonheur et je me demandais encore
pourquoi je ne le partagerais pas.
Lui-même vint me dire, dès le second jour, que le P. Prédicateur,
après avoir entendu l’exposé de ses raisons et de la marche que sa
vocation avait suivie, s’était déclaré complètement d’accord avec
son directeur. Et le brave garçon rayonnait de joie, à me rendre
jaloux.
A mon tour, j’allai demander conseil au Père. Je lui dis ce que
j’avais été dans le passé, ma conversion, les idées qui se heurtaient
dans ma pauvre tête pour le choix de ma carrière. Je ne lui cachai
pas que mon directeur voyait en moi deux obstacles à la vie
religieuse : exubérance d’imagination et de sensibilité, besoin
impérieux de liberté et de mouvement au dehors. Il me demanda :
« Votre directeur vous connaît-il bien ?
— A fond, depuis bientôt deux ans.
— Quel est son avis relativement à vos aptitudes ?
— Il pense que je suis plutôt fait pour l’action chrétienne dans le
monde.
— Et vous, vous êtes-vous déjà senti attiré vers ce but ? »
Je lui racontai l’effet qu’avaient produit sur moi la conférence de
M. de Mun et d’autres discours semblables, ajoutant que mes
réflexions n’avaient guère affaibli ces impressions. Il me pria de lui
apporter par écrit mon élection, c’est à dire, la balance de mes
raisons pour et contre la vie religieuse, et pour et contre l’action
chrétienne dans le monde. Quand il l’eut bien examinée et que nous
eûmes encore discuté certains points de détail, il conclut : « Mon
ami, je crois que Dieu ne réclame pas de vous le renoncement dans
le cloître, mais le dévouement chrétien dans le monde. Vous y ferez
beaucoup pour sa gloire, si vous travaillez loyalement à mettre en
œuvre tout ce qu’il vous a donné pour cela. Ne soyez pas mécontent
de votre sort : il est méritoire et beau ! »
J’avais bien envie de le croire sur parole ; mais, au moment de
renoncer d’une façon irrévocable à cet idéal qui m’avait paru et me
paraissait encore si supérieur à tout le reste, je me sentais pris d’un
regret amer. J’allai demander à mon Père spirituel si ce regret ne
prouvait pas que j’étais peut-être appelé quand même. Il me
répondit :
« Mon fils, tout chrétien qui estime à sa véritable valeur la vie
religieuse peut avoir le désir d’y être appelé et le regret de ne pas
l’être : il en est d’elle comme du martyre sanglant, comme de toute
grâce privilégiée que Dieu juge bon de réserver aux âmes de son
choix. Votre ami Jean a la meilleure part : vous ne voudriez pas qu’il
en fût privé !
— Oh ! mon Père !
— La vôtre est moins belle : cela vous facilitera l’humilité ; mais il
n’en est pas de plus belle après la sienne. De plus, les deux se
complètent : où ne peut aller un religieux, là peut souvent aller un
homme du monde pour faire l’œuvre de Dieu. Jean ne pourra être ni
magistrat, ni orateur de réunions populaires, ni député, ni ministre :
mais vous, si vous voulez le devenir, qu’est-ce qui vous en
empêchera ?
— Mon père, vous tentez mon orgueil ?
— Non, mon ami. Ce que je vous propose, n’est pas une
satisfaction d’amour-propre : il faut laisser cette faiblesse aux
ambitieux vulgaires et ne garder pour vous que l’ambition du bien.
Ce que je tente chez vous, c’est la générosité du jeune homme
chrétien, qui ne veut pas marchander à Dieu les intérêts du capital
reçu et qui regarde le dévouement à la cause divine comme un
devoir. Soyez d’ailleurs persuadé, Paul, que ce devoir vous
imposera plus d’une peine, peut-être de rudes sacrifices : Jean sera
là pour vous aider de ses prières, de son amitié persévérante et de
ses conseils.
— Est-ce votre dernier arrêt, mon Père ?
— C’est, je crois, mon cher enfant, l’arrêt du bon Dieu.
— Je l’accepte comme tel, mon Père, et je vais le lui dire à la
chapelle. »
J’ai été à la chapelle, devant le tabernacle, où j’ai pleuré, prié et
immolé la victime : j’en suis sorti, non pas joyeux, mais pacifié et
résolu. Mon plan de campagne pour l’avenir est établi dans ses
lignes essentielles et approuvé par qui de droit : je n’ai plus qu’à
marcher.
Jean m’invite à aller passer huit jours chez lui après nos
examens : je compte que mes parents n’y feront pas obstacle. Ce
sera une douce consolation.
Je garderai longtemps le souvenir des jours trop rapides que je
viens de passer dans cette délicieuse solitude. Solitude relative,
puisque nous étions une trentaine, écoutant les mêmes instructions,
priant ensemble, mangeant ensemble, prenant ensemble nos
récréations. Mais après s’être délassés en des parties de vise
homériques, on retrouvait avec bonheur son humble cellule de
moine, où l’on était vraiment seul avec sa pensée et le bon Dieu. Se
sentait-on la tête un peu lourde, on s’en allait sous les ombrages du
jardin respirer l’air pur des champs et le parfum des fleurs. Il n’était
pas défendu de s’asseoir dans l’herbe avec un livre édifiant, voire
même d’écouter les oiseaux qui louaient Dieu. Point de surveillance
officielle : on était en famille. Aussi, au déjeuner de clôture, en
remerciant au nom de tous le P. Prédicateur et les autres Pères, ai-je
pu dire en toute sincérité que nous leur devions quatre jours de
paradis.
« Vous allez les payer, » a répondu le Père, et il a expliqué ce
mot en nous rappelant que les consolations d’en haut sont un simple
prêt, dont Dieu exige le remboursement en actes de vertus et en
bons efforts. Nous paierons.

21 juin : fête de saint Louis de Gonzague, jésuite, patron de la


jeunesse studieuse. — Monseigneur est venu donner la confirmation
aux premiers communiants du collège et présider une séance
littéraire, que lui a offerte la classe d’Humanités. Il s’est montré,
comme toujours, fort aimable pour les jeunes Académiciens, dont il a
loué le beau style et le débit naturel. Il n’a rien dit du fond. C’était
presque uniquement de la critique littéraire, très savante
assurément ; mais peut-être l’avait-il trouvée trop savante pour des
élèves. Peut-être aussi ne fais-je que lui prêter impertinemment mes
propres impressions.
29 juin : fête de saint Paul et la mienne. — Le bon Dieu a-t-il
voulu me récompenser déjà de mon sacrifice et m’encourager ? En
tout cas, qu’il soit mille fois béni !
A la récréation de midi, le portier, d’un air mystérieux, vient
m’appeler au parloir, refusant obstinément de me dire le nom du
visiteur : « C’est un monsieur. »
Le monsieur était mon père, que je croyais à soixante lieues d’ici.
Quand j’entrai, son visage rayonnait ; il jouissait de ma stupéfaction :
« Eh ! bien, tu ne m’attendais pas, hein ?
— Non, papa.
— J’ai voulu te faire une surprise… »
Et il m’embrassa très fort sur une joue.
« Puis te souhaiter une bonne fête… »
Et il m’embrassa plus fort encore sur l’autre joue.
« Puis… Asseyons-nous là… Tu te rappelles ce que tu m’as
demandé l’an dernier pour ta fête.
— Parfaitement, papa. Vous m’avez promis qu’aux prochaines
vacances…
— Oui, mais…
— Vous reculez ?
— Mais non. J’ai, au contraire, trouvé que c’était trop long de te
faire attendre jusque-là.
— Et vous allez vous confesser tout de suite ?
— C’est fait depuis hier et je viens exprès t’en apporter la
nouvelle pour ta fête. »
Je me jetai à son cou et, ma foi, nous pleurâmes comme deux
fontaines. Quand nous nous fûmes essuyé les yeux, il me dit :
« Qu’est-ce que tu désires encore, Paul ?
— Moi ? Rien, papa. Je n’ai plus rien à désirer.
— Tu ne voudrais pas retourner à Lourdes ?
— Oh ! cela, si. A nous deux ?
— Avec moi, ta mère et ta sœur. Serons-nous trop pour dire
merci à la Vierge ?
— A peine assez. Que vous êtes bon !
— C’est Dieu qui est bon, mon fils… Je n’aurais pas cru qu’on
pût être si heureux de rentrer en grâce avec lui… Mais j’ai à te
remercier, toi aussi, Paul : car, en définitive, c’est toi qui m’as
converti.
— Après avoir été moi-même converti par les Pères.
— Aussi je veux leur dire ma reconnaissance. Quand nous
aurons causé, tu me feras voir ton directeur. »
L’entrevue fut très cordiale. Papa remercia le Père avec effusion
de tout ce qu’il avait bien voulu faire pour nous deux ; puis il parla
encore du bonheur intime dont il jouissait, depuis qu’il avait « écoulé
son stock de vingt-cinq ans dans les larges manches d’un bon P.
Capucin. » Il finit par recommander à ses meilleures prières la
persévérance du père et du fils.
Quelle joie pour ma mère et ma sœur ! Merci, mon Dieu,
merci !… Cette nouvelle grâce, que je n’osais pas attendre si
prompte et si complète, vaut bien de ma part un redoublement de
confiance et de dévouement à votre divin Cœur, auquel je me suis
donné pour la vie.

4 juillet. — Les fêtes du P. Recteur se sont passées joyeuses,


en famille, comme l’an dernier. Pas plus de nuages dans les cœurs
que dans le ciel. La pièce où j’avais un rôle assez absorbant, le
discours-compliment qui me revenait encore à titre de préfet, les
grands jeux Olympiques dont j’étais un des chorèges, ne m’ont
guère laissé de loisir pour les raconter.
Et maintenant, ma pauvre Jeanne, il faudra que tu fasses ton
deuil de mon journal : les examens sont devant la porte et, plus que
jamais, le devoir doit passer avant le plaisir.
Et puis, las ! si tu veux tout savoir : à mesure que les jours me
rapprochent de la fin, je me sens envahir par une invincible tristesse.
Songe donc qu’avant un mois, je serai ancien et loin de ce collège,
dans lequel j’ai passé deux ans d’une vie si calme et si douce, qui ne
reviendront plus jamais ! Je t’assure que, par moments, j’ai besoin
de toute ma raison et de toute ma volonté pour ne point fléchir sous
ce pénible sentiment. Pénible, il faut qu’il le soit beaucoup, puisqu’il
résiste même à une pensée, bien agréable pourtant, celle de notre
second pèlerinage à Lourdes et des vacances qui suivront…
Allons, soyons homme, et « vive labeur ! »

16 juillet. — Ce matin, à la fête des adieux, au nom de tous les


Congréganistes partants, Jean, le plus ancien d’entre nous, a
solennellement promis fidélité au drapeau de Marie, Reine du Ciel et
de la France. Je l’ai promise avec lui, dans le meilleur fond de mon
âme, et s’il plaît à Dieu, je tiendrai parole.
Encore quelques jours, et il faudra dire adieu à cette chapelle de
Congrégation, qui est bien véritablement le cœur même du collège,
puisque c’est de là que le sang le plus pur se répand dans tous les
membres du corps. Je ne la quitterai pas sans émotion ; car, avec
plus de raison que personne, je puis m’appliquer les paroles de la
Sagesse que le P. Recteur nous a développées : Venerunt mihi
omnia bona pariter cum illa. Tous les biens ne sont venus avec la
Congrégation, qui m’a fait pour la vie enfant de la sainte Vierge.
C’est la sainte Vierge qui m’a soutenu à seize et dix-sept ans dans
mes défaillances : elle me soutiendra, j’en ai la confiance, dans la
vie de jeune homme où je vais entrer, puis dans l’âge viril et jusqu’au
bout, et in hora mortis nostrae. Amen.

31 juillet : fête de saint Ignace, fondateur de la Compagnie de


Jésus. — C’est la veille du départ. Demain, les chaînes tombent, le
cachot s’ouvre, le soleil succédera au jour sombre et les malheureux
captifs pourront désormais jouir à pleins poumons du grand air de la
liberté !…
Voilà de jolis mots, bons à dire aux toutous de la petite division,
pour qui le dernier terme de la vie et le bonheur parfait, c’est les
vacances ! Cette naïveté fait pitié, quand on est philosophe et qu’on
va s’en aller pour toujours. Pour moi, ce serait plutôt le dernier jour
d’un condamné.
Cependant la journée a été belle et bien remplie. Le matin,
communion générale, où nous avons prié de notre mieux, j’en
réponds en ce qui me regarde, pour nos Pères. Puis, brillante messe
en musique, œuvre toute neuve du P. C., avec panégyrique du saint
fondateur par un orateur étranger très fleuri, qui s’est cru tenu de
casser une bonne demi-douzaine d’encensoirs sur le nez des
Jésuites passés, présents et à venir : Jean le futur novice en riait
aux larmes dans son mouchoir. N’a pas qui veut la main légère : il
faut voir la bonne intention des gens.
Je ne sais pas quel dîner on a servi au panégyriste pour le payer
de ses hyperboles : le nôtre était digne de la bonté des Pères, qu’on
accuse parfois de trop bien traiter leurs enfants. Mais puisque nous
sommes leurs enfants !… Le reproche ne tient pas debout. Et
d’ailleurs, ce n’est pas tous les jours fête de notre grand-grand-père !
A deux heures, distribution solennelle des prix. Le discours obligé
sur un sujet de haute pédagogie, cette fois, n’a paru ni trop long ni
trop court, ni trop pompeux ni trop familier, et n’a ennuyé personne,
par la bonne raison qu’il n’a pas eu lieu. On l’avait heureusement
remplacé par un dialogue entre élèves sur les meilleurs plaisirs des
vacances. Intéressant et moral… Ces Jésuites !

Aimez-vous la morale ? On en a mis partout,

… jusqu’au dernier jour de l’année, mais dissimulée en tartines si


appétissantes qu’elle passe toujours.
J’ai partagé fraternellement avec Jean le prix de sagesse,
décerné par le suffrage des élèves avec l’approbation des maîtres,
et le prix d’honneur de philosophie. Chacun deux prix, un premier et
un second : ce qui faisait pour chacun quatre plaisirs — sans parler
de plusieurs autres couronnes que nous avons pu offrir sur l’autel,
au grand salut du soir.
A cette cérémonie, nous avons aussi, une dernière fois, côte à
côte, adressé ensemble au Dieu de l’Eucharistie, avec nos prières,
la fumée de nos encensoirs. Dans quelques années, Jean montera à
l’autel, et moi, trop heureux, je lui servirai d’enfant de chœur…
Puis enfin, le soir, j’ai pris mon pauvre gros cœur à deux mains,
pour aller dire adieu aux Pères qui avaient été bons pour moi, c’est-
à-dire, à tous ceux que je connaissais…
Et demain, je les quitte, mais pas tout entier : car mon cœur est à
eux — à la vie, à la mort.

Paul.
AUJOURD’HUI

Mars 1903.

Le lendemain de cette distribution, je suis parti avec Jean pour


subir mes examens : nous avons été reçus le même jour, avec la
même mention honorable. Ensuite j’ai passé chez lui une semaine
charmante : on m’a traité comme si j’avais été de la famille.
J’y ai vu Marguerite, qui avait quinze ans et ressemblait à son
frère comme une goutte d’eau limpide ressemble à une autre goutte
d’eau limpide. Elle était trop enfant pour garder mon souvenir : moi,
je ne l’ai plus oubliée. Six ans après, quand je fus docteur en droit, je
la revis et, sur le bon témoignage que me rendit Jean, ses parents
voulurent bien me la donner. Elle est la crème des épouses et des
mères, une seconde Jeanne.
Le jour où Marguerite est devenue ma femme, Jeanne devenait
celle de Louis, qui est aujourd’hui le premier avoué de X… Elles
s’aiment comme deux sœurs ; Louis et moi sommes restés frères.
Dieu a béni ces deux unions en nous envoyant de charmants
enfants, qui font notre joie et celle de leurs trop bons grands-parents.
Il a prélevé la dîme sur les miens, en m’enlevant mon premier né,
retourné au ciel à deux ans ; mais ce cher ange protège de là-haut
ses frères et sœurs. J’avais mis les deux suivants dans mon collège,
dont le P. Jean, leur oncle, dirigeait les études comme Préfet. L’an
dernier, la loi scélérate ayant jeté les Pères à la porte de leurs
maisons, mon aîné, qui venait de gagner ses deux diplômes,
m’annonça que Dieu l’appelait à les suivre en exil au noviciat. J’en
suis fier.
Il me reste trois garçons. Le plus âgé va avoir quinze ans : il
continue provisoirement ses études au collège, sous de nouveaux
maîtres qui s’attachent à conserver les anciennes traditions de la
Compagnie de Jésus. Si l’iniquité triomphe tout à fait et si on leur
retire, à eux aussi, le droit d’enseigner, mon fils ira chercher à
l’étranger, au bout du monde s’il le faut, auprès des religieux
expulsés, l’éducation chrétienne, proscrite en France, et plus tard
ses jeunes frères le rejoindront. Aucun d’eux, à aucun prix — je l’ai
juré devant Dieu — ne mettra les pieds dans un lycée. Pourquoi ?
Ceux qui ont lu ces Lettres le savent : c’est parce que j’y ai passé.
L’âme de mes enfants m’est plus chère que tout le reste, plus chère
que leur vie et que leur avenir terrestre : je ne la livrerai point, et
personne ne me l’arrachera.
Ma situation indépendante me permet de pratiquer ma foi
publiquement, à la barbe des sectaires d’en bas et d’en haut. Je suis
conseiller général et je serai député. Le gouvernement actuel, qui ne
m’inspire pas plus de crainte que d’estime, peut être assuré
d’avance que je combattrai de tous mes moyens d’honnête homme
sa politique odieuse, qui, sous des prétextes plus hypocrites les uns
que les autres, ne sait que tyranniser nos consciences, rançonner
nos bourses et humilier notre patriotisme. J’espère ne pas être seul
dans cette lutte pro aris et focis.
Quant à l’Université officielle, que ma naïve jeunesse rêvait de
convertir, le temps et les événements ont bien changé mes idées.
Depuis qu’elle s’est faite la plate complice des projets maçonniques
et que, pour assurer son triomphe, elle accepte sans honte
l’étranglement de la libre concurrence, la machine n’est plus
seulement avariée : elle est malfaisante. Dès que les honnêtes gens
seront redevenus les maîtres, ils feront bien de la mettre au rancart
et de la remplacer par un système plus conforme aux droits sacrés
du citoyen et du père de famille. Je ne demande pas que le
monopole passe de la gauche à la droite : je ne veux aucun
monopole, ni officiel, ni déguisé. Mais j’entends que la loi m’assure
la liberté de faire instruire mes enfants selon mes convictions, par
les maîtres de mon choix et sans préjudice pour leur carrière. Hors
de là, il n’y aura ni justice ni sécurité.
Récemment, un de ces libéraux de comédie, qui votent toutes les
oppressions, clamait à la Chambre : « La liberté est en marche ! »
Nous relevons ce mot pour la vraie liberté, la liberté de tous. Oui,
malgré toutes les apparences contraires, elle est en marche, et si
l’Université prétend lui barrer le chemin, cette liberté-là passera sur
le corps de l’Université, qui n’aura que son dû.

You might also like