Gar CIA 2004 Thesis
Gar CIA 2004 Thesis
Gar CIA 2004 Thesis
SECTION D’ÉLECTRICITÉ
PAR
Lausanne, EPFL
2004
Acknowledgments
This thesis is the result of three years work and it would never have been possible for me
to accomplish this without the help, support and encouragement from many people. First,
I wish to express my gratitude to Prof. Touradj Ebrahimi for giving me the possibility to
do this thesis in his group. His engagement in my research and the time he spent with
me discussing different problems ranging from philosophical issues to technical details have
been essential for the results presented here.
I am specially grateful to Dr. Jean-Marc Vesin with whom I have been working very
closely in most of the research presented here. His never ending stream of ideas and his
passion for research were a source of inspiration for me.
My gratitude to Prof. Juan Mosig, Prof. Ferran Marqués, Prof. Yves Biollay, and
Dr. Thomas Koenig for accepting to be part of the committee, and for their valuable
comments on my work. I would like to express special thanks to Prof. Yves Biollay for his
interest in my work and the fruitful discussions we had.
For his valuable help in shaping many ideas for this work, I would like to acknowledge
Ulrich Hoffmann with whom I had the privilege to work in close collaboration.
I also wish to thank Lam Dang who provided me with valuable inputs to improve the
comprehensibility of this document and foremost because during these years we shared our
interest for signal processing and machine learning.
Several friends, including Jonathan Nieto, Emir Vela, and Abel Villca participated in
the experiments, and made useful suggestions to improve the usability of the system.
For the help and support in the technical and administrative matters, I would like to
thank Gilles Auric, Marianne Marion, and Fabienne Vionnet.
Finally, my sincere gratitude goes to my family for their continuous support and en-
couragement.
iii
iv Acknowledgments
Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
v
vi Contents
4 Feature extraction 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 An overview of time-frequency analysis for stochastic signals . . . . . . . . . 50
4.2.1 Time-frequency analysis of univariate stochastic signals . . . . . . . 51
4.2.2 Time-frequency analysis of multivariate stochastic signals . . . . . . 56
4.2.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.5 Absence of coupling between the univariate components . . . . . . . 60
4.2.6 Existence of a linear prediction model . . . . . . . . . . . . . . . . . 60
4.2.7 Weak coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Stationary PSD mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Coherence mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Autoregressive mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Non-stationary autoregressive mapping . . . . . . . . . . . . . . . . . . . . . 69
4.7 Multivariate autoregressive mapping . . . . . . . . . . . . . . . . . . . . . . 72
4.8 Synchronization mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Pattern recognition 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Membership functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Estimation of the membership parameters . . . . . . . . . . . . . . . . . . . 82
5.3.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.2 Risk functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.3 Nature of the functional space H . . . . . . . . . . . . . . . . . . . . 85
5.3.4 Geometrical interpretation . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.5 Regularized risk minimization . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.1 Polynomial kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.2 Gaussian kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Choice of the parameters ν and σ . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.1 Cross-validation approach . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.2 Theoretical bound based approach . . . . . . . . . . . . . . . . . . . 101
5.6 Dynamic updating of the membership parameters . . . . . . . . . . . . . . . 102
5.6.1 Dynamic updating of the membership function . . . . . . . . . . . . 104
5.6.2 Dynamic updating of the membership threshold and offset . . . . . . 107
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Contents vii
7 Conclusions 133
7.1 Summary of achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A Appendix 137
A.1 Membership boundary induced by the Gaussian kernel . . . . . . . . . . . . 137
A.2 Computing the radius of the smallest sphere containing the training data . 138
A.3 Computing the derivative
of the theoretical bound B with respect to σ . . . 141
∂R2
A.3.1 Computing ∂σ . . . . . . . . . . . . . . . . . . . . . . . . . . 141
α
~ fixed
2
∂kwk
A.3.2 Computing ∂σ H . . . . . . . . . . . . . . . . . . . . . . . . . 142
α~ fixed
∂ρ
A.3.3 Computing ∂σ . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
α
~ fixed
A.3.4 Computing ∂~
α
∂σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
A.3.5 Computing ∇α̃L kwk2H . . . . . . . . . . . . . . . . . . . . . . . . . . 143
∂ρ
A.3.6 Computing ∂~
αL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B Appendix 145
B.1 Training without feedback sessions . . . . . . . . . . . . . . . . . . . . . . . 145
B.2 Training with feedback sessions . . . . . . . . . . . . . . . . . . . . . . . . . 148
viii Contents
Version Abrégée
Les signaux obtenus par électroencéphalogramme (EEG) fournissent des indices sur l’activité
synaptique combinée des groupes de neurones. En plus de leurs applications cliniques, les
signaux EEG peuvent être utilisés en tant que support pour le développement d’interfaces
de communication directe entre cerveau et ordinateur (interface cerveau-ordinateur ICO).
Lorsque des activités mentales sont exécutées, des caractéristiques spécifiques apparais-
sent dans l’EEG. Si des actions (produites par l’ICO) sont mises en correspondance avec
de types de caractéristiques associées à des activités mentales qui n’impliquent aucun effort
physique, alors la communication par la simple pensée devient possible. L’utilisateur opère
l’ICO en exécutant des activités mentales qui sont reconnues par l’ICO grâce à des modèles
de reconnaissance ayant été établis lors d’une phase d’entraı̂nement.
Dans le cadre de cette thèse, nous considérons le positionnement d’un objet dans un en-
vironnement bidimensionnel généré par ordinateur (EGO). L’objet peut être déplacé suivant
quatre directions correspondant à des activités mentales différentes. Le fonctionnement de
l’ICO est asynchrone, à savoir que le systéme est actif en permanence et génère de mouve-
ments de l’objet seulement lorsqu’il reconnaı̂t l’une des activités mentales correspondantes.
L’ICO analyse de segments d’EEG et génère de mouvements d’après un ensemble de règles
(règles d’action) qui sont adaptées au niveau d’expérience de l’utilisateur lors du contrôle
de l’application.
Les signaux EEG sont de faible amplitude et sont donc particulièrement sensibles à des
perturbations extérieures. De plus, les changements abrupts apparaissant lors des activ-
ités musculaires, en particulier oculaires (artefacts), peuvent entraver le fonctionnement de
l’ICO et même mener à des conclusions erronées sur la capacité de l’utilisateur à contrôler
l’ICO. Ainsi, il est particulièrement important de filtrer les perturbations extérieures et
détecter les artefacts. Les perturbations extérieures sont filtrées à l’aide de techniques de
traitement de signaux classiques et les artefacts sont détectés en utilisant un algorithme
de détection d’événements basé sur des méthodes dites du type noyau. Les paramètres
de détection sont calibrés au début de chaque expérience de façon interactive. Lorsqu’un
artefact est détecté dans un segment d’EEG, l’ICO en avertit l’utilisateur au moyen d’un
événement particulier qui se produit dans l’EGO.
L’analyse des propriétés des signaux EEG en temps, fréquence et phase fourni des
mesures statistiques (attributs) qui sont utiles pour la reconnaissance des activités men-
ix
x Version Abrégée
tales à partir de segments d’EEG. Cependant l’analyse extensive dans les domaines temps,
fréquence et phase produirait un très grand nombre d’attributs. Moyennant des hypothèses
sur la nature des signaux EEG il est possible de réduire le nombre d’attributs nécessaires.
Les attributs sont groupés au sein d’un vecteur d’attributs à partir duquel les modèles de
reconnaissance sont établis en utilisant de concepts d’apprentissage artificiel. Du point de
vue de l’apprentissage artificiel, des vecteurs d’attributs à faible dimension sont préférables
car ils réduisent le risque de sur-apprentissage.
Les modèles de reconnaissance sont construits sur la base de la théorie de l’apprentissage
statistique et plus particulièrement des méthodes du type noyau. L’avantage d’une telle
approche réside dans le fait que les modèles de reconnaissance ainsi construits atteignent
de taux de reconnaissance supérieurs aux autres tout en étant très flexibles.
Un fonctionnement adéquat de l’ICO requiert l’adaptation continue des modèles de
reconnaissance à de possibles changements pouvant apparaı̂tre dans les signaux EEG, et
résultant des conditions externes différentes et de l’habituation de l’utilisateur à l’ICO. Cette
adaptation est implémentée au moyen de l’apprentissage dynamique des paramètres des
modèles de reconnaissance. Ainsi, ces paramètres peuvent être mis à jour continuellement
et de façon considérablement efficace en termes de temps de calcul.
A la fin d’une première série de séances d’apprentissage, les méthodes pour l’extraction
des vecteurs d’attributs sont choisies (d’après un critère d’optimalité lié à l’erreur de re-
connaissance), les modèles de reconnaissance pour chaque activité mentale son construits
et les règles d’action sont établies. Durant ces séances les activités mentales son présentées
suivant un plan qui est défini par rapport à un protocole d’apprentissage.
Dans les séances d’apprentissage suivantes, l’ICO donne un retour à l’utilisateur pour lui
indiquer le degré de reconnaissance de l’activité mentale qu’il lui a été demandé d’exécuter.
Ainsi, l’utilisateur peut moduler son activité cérébrale afin d’obtenir un retour positif. A
la fin de chaque séance les modèles de reconnaissance sont mis à jour. Ceci est accompli
aisément due à la nature dynamique des paramètres des modèles. Puisque les modèles de re-
connaissance changent dynamiquement, les règles d’action doivent changer en conséquence.
Ceci se fait automatiquement car les règles d’action dépendent des paramètres des modèles.
L’ICO développé dans le cadre de cette thèse, a été validé par des expériences sur
six sujets ayant participé à neuf séances d’apprentissage. Les trois premières séances ont
servi à choisir les méthodes d’extraction de vecteurs d’attributs, construire les modèles de
reconnaissance initiaux et établir les règles d’action. Dans les six dernières séances, en plus
de l’expérience avec retour, des expériences de positionnement de l’objet ont étés réalisées
afin d’évaluer l’expérience acquise lors de chaque séance. L’évaluation a été effectuée suivant
deux critères, à savoir le calcul théorique du taux de transfert d’information en considérant
l’erreur de reconnaissance moyen sur les activités mentales et la mesure expérimentale du
taux de transfert d’information associée au test de positionnement. Cette dernière présente
l’avantage de refléter plus étroitement les capacités réelles du sujet. Les deux mesures de
taux de transfert d’information ont augmenté au cours des six dernières séances et ont
atteint un taux moyen (sur les sujets) de 126 et 25 bits par minute respectivement.
Abstract
Scalp recorded electroencephalogram signals (EEG) reflect the combined synaptic and ax-
onal activity of groups of neurons. In addition to their clinical applications, EEG signals
can be used as support for direct brain-computer communication devices (Brain-Computer
Interfaces BCIs). Indeed, during the performance of mental activities, EEG patterns that
characterize them emerge. If actions executed by the BCI, are associated with classes of
patterns resulting from mental activities that do not involve any physical effort, commu-
nication by means of thoughts is achieved. The subject operates the BCI by performing
mental activities which are recognized by the BCI through comparison with recognition
models that are set up during a training phase.
In this thesis we consider a 2D object positioning application in a computer-rendered
environment (CRE) that is operated with four mental activities (controlling MAs). BCI
operation is asynchronous, namely the system is always active and reacts only when it
recognizes any of the controlling MAs. The BCI analyzes segments of EEG (EEG-trials)
and executes actions on the CRE in accordance with a set of rules (action rules) adapted
to the subject controlling skills.
EEG signals have small amplitudes and are therefore sensitive to external electromag-
netic perturbations. In addition, subject-generated artifacts (ocular and muscular) can
hinder BCI operation and even lead to misleading conclusions regarding the real control-
ling skills of a subject. Thus, it is especially important to remove external perturbations and
detect subject-generated artifacts. External perturbations are removed using established
signal processing techniques and artifacts are detected through a singular event detection
algorithm based on kernel methods. The detection parameters are calibrated at the begin-
ning of each experimental session through an interactive procedure. Whenever an artifact
is detected in an EEG-trial the BCI notifies the subject by executing a special action.
Features that are relevant for the recognition of the controlling MAs are extracted from
EEG-trials (free of artifacts) through the statistical analysis of their time, frequency, and
phase properties. Since a complete analysis covering all these aspects, would result in a
very large number of features, various hypotheses on the nature of EEG are considered in
order to reduce the number of needed features.
Features are grouped into feature vectors that are used to build the recognition models
using machine learning concepts. From a machine learning point of view, low dimensional
xi
xii Abstract
xiii
xiv Notation and Terminology
Functions
As (·, ·) Ambiguity function of s
Am1,m2 (·, ·) Inter ambiguity function of signals sm1 and sm2
h·, ·i0 Inner product in vector space 0
k·k0 Norm in vector space 0
ck (·, ·, ·) k-th loss function associated with MAk
Cm1,m2 (·) Coherence function of signals sm1 and sm2
Notation and Terminology xv
Abbreviations
ADB Artifact detection block
AIC Akaike information criterion
BCI EEG based brain-computer interface
CRE Computer-rendered environment
EEG Scalp recorded electroencephalogram
ERD Event related desynchronization
ERS Event related synchronization
ERP Event related potentials
FPE Final prediction error
fMRI Functional magnetic resonance imaging
xvi Notation and Terminology
xvii
xviii List of Figures
4.9 Estimation of the synchronization between two EEG channels in the alpha
band . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.10 Summary of mappings from the EEG-trial set into feature vector spaces . . 78
A.1 Separating boundary for growing values of the Gaussian kernel parameter . 139
A.2 Distribution of the expansion coefficients for growing values of the Gaussian
kernel parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.3 Evolution of the FSV, FTE and GE for growing values of the Gaussian kernel
parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xx List of Figures
List of Tables
B.1 Subject S1: Recognition error associated with each MA and mapping . . . . 145
B.2 Subject S2: Recognition error associated with each MA and mapping . . . . 146
B.3 Subject S3: Recognition error associated with each MA and mapping . . . . 146
B.4 Subject S4: Recognition error associated with each MA and mapping . . . . 146
B.5 Subject S5: Recognition error associated with each MA and mapping . . . . 147
B.6 Subject S6: Recognition error associated with each MA and mapping . . . . 147
B.7 Subject S1: number of EEG-trials after artifact-detection (training-with-
feedback sessions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
B.8 Subject S1: Recognition error evolution over training-with-feedback sessions 148
B.9 Subject S1: Positioning tests results . . . . . . . . . . . . . . . . . . . . . . 149
B.10 Subject S2: number of EEG-trials after artifact-detection (training-with-
feedback sessions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.11 Subject S2: Recognition error evolution over training-with-feedback sessions 149
B.12 Subject S2: Positioning tests results . . . . . . . . . . . . . . . . . . . . . . 150
B.13 Subject S3: number of EEG-trials after artifact-detection (training-with-
feedback sessions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.14 Subject S3: Recognition error evolution over training-with-feedback sessions 150
B.15 Subject S3: Positioning tests results . . . . . . . . . . . . . . . . . . . . . . 151
B.16 Subject S4: number of EEG-trials after artifact-detection (training-with-
feedback sessions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
B.17 Subject S4: Recognition error evolution over training-with-feedback sessions 151
B.18 Subject S4: Positioning tests results . . . . . . . . . . . . . . . . . . . . . . 152
xxi
xxii List of Tables
1
2 Chapter 1. Introduction
tems because of its noninvasiveness, relative simplicity and low cost. We therefore focus our
attention on the design and development of a scalp recorded electroencephalogram based
BCI.
The types of MAs used in current BCIs were chosen in accordance with brain hemi-
spheric specialization studies which suggest that the two brain hemispheres are specialized
for different cognitive functions. The left hemisphere appears to be predominantly involved
during verbal and other analytical functions and the right hemisphere in spatial and holistic
processing. Thus typical MAs include: evoked response to external stimuli, imagined limb
movement, and spatial, geometrical, arithmetical, and verbal operations.
Following the type of MAs they use, BCIs are categorized into evoked response and
operant conditioning based ones. Evoked response based BCIs rely on subject attention-
focusing to particular stimuli that are associated with actions. Operant conditioning based
BCIs react to particular MAs (controlling MAs) executed by the subject. These MAs are
recognized by the BCI through recognition models that are built during a training phase
and continuously updated.
BCI operation can be of two types: synchronous, and asynchronous. In synchronous
BCIs, the system is active only during some periods defined by the operator and the subject.
Conversely, asynchronous BCIs are always active and react only when the subject performs
the controlling MAs.
tifacts (e.g. ocular and muscular artifacts), and adequate design of the training protocols
and the evaluation scheme. Detection of artifacts is especially important as they can lead
to misleading conclusions about the subject’s ability to control the BCI. Indeed, the subject
might be (voluntarily or not) controlling the BCI by generating artifacts.
• Choice of the optimal feature extraction method for each mental activity.
• Definition of action rules that determine BCI operation in function of the mental
activities and subject controlling skills.
1.4 Outline
The dissertation is organized in seven chapters. In chapter 2, the BCI architecture, oper-
ation mode and main concepts are defined. Furthermore, state-of-the-art implementations
are presented and compared. The general architecture presented in this chapter serves as
a thread for subsequent chapters in which we detail our solution.
Chapter 3 presents the basic elements of electroencephalography, the EEG acquisition
procedure, and the preprocessing algorithms aiming at removing external noise and detect-
ing artifacts in EEG signals. In chapter 4, a general framework for the analysis of EEG
is developed and through the establishment of hypotheses on the nature of EEG signals,
we derive several feature extraction methods. Chapter 5 describes the algorithms used to
establish recognition models for the mental activities used to control the BCI as well as the
4 Chapter 1. Introduction
dynamic updating of these models parameters. Chapter 6 presents the application of our
BCI implementation in the framework of an asynchronous 2D positioning application.
Conclusions and an outline of some interesting future research directions are presented
in Chapter 7. Complementary details on the nature of the recognition models discussed in
Chapter 5 are given in Appendix A. Finally, complementary numerical results corresponding
to the experiments that we carried out are provided in Appendix B.
Conceptual framework and
state of the art 2
“An expert is a man who has made all the mistakes,
which can be made, in a very narrow field .” Niels Bohr
2.1 Introduction
A BCI is a communication system which allows a subject to act on his environment only
by means of his thoughts, without using the brain’s normal output pathways of peripheral
nerves and muscles. Like any communication system a BCI has inputs (electrophysiological
signals that result from brain activity monitoring) outputs (device actions), elements that
transform inputs into outputs, and a protocol that determines its operation [167, 175].
The subject controls the active device by performing mental activities (MAs) which are
associated with actions that are dependent on the BCI application (see Fig. 2.1). Typi-
cal BCI applications include control of the elements in a computer-rendered environment
(e.g. cursor positioning [63, 176], visit of a virtual apartment [11, 12]), spelling programs
(e.g. virtual keyboard [120]), and command of an external device (e.g. robot [107], prosthe-
sis [128]).
The association between MAs and actions requires the selection of a set (controlling
set) of MAs to which the BCI responds, and the identification of signatures in the brain
activity that characterize each MA in the controlling set. These signatures are identified
through the analysis of the electrophysiological signals recorded during the performance of
the MAs in the controlling set.
The basic BCI design is depicted in Fig. 2.1. The monitoring of the subject’s brain
activity results in electrophysiological signals that are analyzed by the signal processing
block. The latter computes measurements on these signals (features) that are grouped into
a feature vector which is sent to the translation-into-commands block. This block recognizes
5
6 Chapter 2. Conceptual framework and state of the art
Figure 2.1. Basic BCI design. Like any communication system a BCI has inputs (electrophysiological
signals that result from brain activity monitoring) outputs (device actions), elements that transform
inputs into outputs, and a protocol that determines its operation
the signatures characterizing the MAs in the controlling set and triggers the corresponding
action on the active device. As this action can be noticed by the subject, it constitutes
a feedback that he can use to modulate his mental activities so as to obtain the desired
result.
In this chapter, we review the approaches in terms of brain activity monitoring and
types of MAs used in current BCIs. In particular, we focus our attention on scalp record-
ed electroencephalogram based BCIs, propose a detailed architecture for such BCIs, and
discuss different implementations in the framework of existing BCIs.
latter are recorded as electroencephalogram (EEG) measurements from the scalp (in which
case, they reflect the activity in large areas of brain cortex), from small electrodes within the
brain (in which case, they reflect the activity in small immediately adjacent areas of tissue),
or from epidural or subdural locations in between these two extremes. In general, the more
the electrodes are invasive, the better the topographical resolution and the signal-to-noise
ratio [167].
The spatial scale of an intracortical electrode (10 µm to 1 mm) depends on the size of the
electrode tip, whereas the scale of unprocessed scalp EEG (6 to 10 cm) is largely independent
of electrode size. Scalp EEG scale may be reduced (to 2 to 3 cm) by using a combination
of multiple electrode arrays (64 or 128 electrodes) and high-resolution EEG algorithms
(spatial filtering). Intracortical (or invasive) electrodes achieve higher spatial resolution at
the expense of spatial coverage and significant increase in cost and risk [117, 118].
Invasive methods need neurosurgical implantation and were first used to record action
potentials in the cerebral cortices of awake animals during movements [45, 76]. With
operant conditioning methods, several studies showed that monkeys could learn to control
the discharge of single neurons in their motor cortex [47, 149, 179, 180]. From such work,
came the expectation that humans could develop similar control and use it to communicate
or to operate neuroprostheses. Evaluation of this possibility was delayed by the lack of
intracortical electrodes suitable for human use and capable of stable long-term recording
from single neurons. Presently, few research groups are active in invasive BCIs in humans.
In [88, 90, 91], a special type of cone electrodes is used to record stable action potentials
from neurons in the motor cortex; such potentials are used to move a cursor to select icons
or letters on a computer screen. The signal to noise ratio can be increased substantially
by invasive technologies. However, research in this field is rather limited as people may be
reluctant to agree to brain implants for research purposes especially because, at present,
successful control or communication with an invasive BCI cannot be guaranteed.
Because it combines high temporal resolution, relative simple acquisition, and low cost,
scalp recorded EEG is predominantly used in current BCIs. In the following, we concen-
trate on describing the architecture and functioning of scalp recorded EEG based BCIs
(henceforth simply called BCI), and the mental activities used to operate such BCIs.
The MAs used in current BCIs were chosen in accordance with brain hemispheric special-
ization studies which suggest that the two hemispheres of the human brain are specialized
for different cognitive functions. In particular, the left hemisphere appears to be predom-
inantly involved in verbal and other analytical functions and the right one in spatial and
holistic processing [5, 55]. Thus, typical MAs include: evoked responses to external stimuli,
imagined limb movement, and spatial, geometrical, arithmetical and verbal operations.
BCIs can be categorized, by the type of MAs they use into evoked response and operant
conditioning based BCIs. Both types are presented hereafter.
8 Chapter 2. Conceptual framework and state of the art
Evoked responses are related to cognitive methods in psychology [163, 167] which consider
the mind as an information processing device whose output depends on the relationship
between stimuli and the activation of cognitive processes.
External visual or auditory events (e.g. blinking objects on a computer screen, flashing
elements on a grid or brief sounds) elicit transient signals in the EEG that are characterized
by voltage deviations known as event related potentials (ERPs). When the subject pays
attention to a particular stimulus, an ERP that is time locked with that stimulus appears
in his EEG. The changes in the EEG signals induced by the ERP can be detected by
using averaging or blind source separation methods [75, 99]. If actions are associated with
stimuli, the subject can gain control of the BCI by focusing his attention on the stimulus
corresponding to the desired action.
Examples of BCIs functioning under evoked conditions are those using the P300 and
the steady state visual evoked responses. We briefly present them hereafter.
Flicker stimuli of variable frequency (2-90 Hz) elicit a SSVER in the EEG which is charac-
terized by an oscillation at the same frequency as the stimulus. Thus, an SSVER can be de-
tected by examining the spectral content of the signals recorded in the visual region, namely
electrodes O1 and O2 of the 10-20 international system (see Chapter 3, Section 3.2.3).
When actions are associated with targets flickering with different frequencies the subject
can control the BCI by gazing at the target corresponding to the desired action [25, 28, 56,
105]. BCIs based on this principle depend on the subject’s ability to control gaze direction.
The advantage of BCIs based on evoked functional conditions resides in the fact that
little or no training is necessary for a new subject to gain control of the BCI. However, the
communication can be slow because of the averaging that is required to reliably detect an
event related potential [175] and, in the case of an ERP-based BCI, the waiting time for the
relevant stimulus presentation [11, 40]. Furthermore, the amplitude of the evoked response
can diminish over time resulting from the user habituation to the stimulus [138].
(training with feedback), the subject is asked to perform a given MA (in the controlling
set) and a feedback is provided, indicating the degree to which the BCI could identify the
MA using the model computed in the previous training session. At the end of each training
session, the recognition models are updated. This process is usually repeated many times
in the course of BCI operation. Thus, the BCI is constantly being adapted to the subject
and he can evaluate and improve his performance.
Examples of BCIs functioning under operant conditioning use slow cortical potential
shifts, oscillatory sonsorimotor activity, and other hemispheric specialized MAs. We briefly
present such BCIs hereafter.
SCPSs last from a few hundred milliseconds up to several seconds and indicate the overall
preparatory excitation level of a cortical network and they are universally present in the
human brain. Negative SCPSs are typically associated with movement and other functions
involving cortical activation, while positive SCPSs are usually associated with reduced
cortical activation [18, 142].
In [73], several methods, ranging from low pass filtering to wavelet decomposition, for
the extraction of SCPSs from EEG are described. For online applications low-pass filtering
appears more suitable.
Subjects can learn through operant feedback to produce a SCPS in an electrically pos-
itive or negative direction for binary control [18, 124]. This skill can be acquired if the
subjects are provided with a feedback on the course of their SCPS production and if they
are positively reinforced for correct responses [17, 18].
In [18, 124], the binary control provided by the regulation of SCPSs and semantic
considerations were used to implement a spelling program through which locked-in patients
could communicate at a rate of one word per minute.
Populations of neurons can form complex networks which are at the origin of oscillatory
activity. In general, the frequency of such oscillations decreases with an increase in the
number of synchronized neuronal assemblies [152]. Two types of oscillations are especially
important: the Rolandic mu rhythm in the range from 7 to 13 Hz and the central beta
rhythm above 13 Hz, both originating in the sensorimotor cortex [80]. Sensory stimulation,
motor behavior, and mental imagery can change the functional connectivity within the
cortex and result in an amplitude suppression (event-related desynchronization ERD) or
in an amplitude enhancement (event-related synchronization ERS) of mu and central beta
rhythms [125].
Preparation and planning of self-paced hand movement results in a short-lasting de-
synchronization (ERD) of Rolandic mu and central beta rhythms. In [32, 160], electro-
corticographic recordings exhibit ERD in the alpha band associated with hand and foot
2.4. BCI architecture and operation 11
movement. The general finding is that similarly to the mu rhythm (around 10 Hz), beta
oscillations desynchronize during the preparation and execution of a motor act [126].
Motor imagery may be seen as mental rehearsal of a motor act without any overt motor
output. It is broadly accepted that mental imagination of movements involves brain re-
gions/functions similar to that involved in programming and preparing such movements [81].
For example, during the imagination of a right-hand or left-hand movement, an ERD over
the contralateral hand area was found [126]. This ERD is characteristic of the planning or
preparation of a real movement [13].
Thus, the main difference between performance and imagery is that in the latter case
execution would be blocked at some corticospinal level [35]. This observation opens the
possibility of using motor imagery to provide a control option in BCI applications.
Oscillatory sensorimotor activity produced by the imagination of left/right hand move-
ment and foot movement are used in a virtual keyboard application and to manipulate a
hand orthosis in [128]. In a related work, the vertical movement of a 2D cursor is controlled
by changes in the mu and beta rhythms in [176].
In addition to imagined motor tasks, other mental activities for which evidence for hemi-
spheric specialization was found, are: geometrical MA [116] (e.g. imagination of a geometric
3D object and the rotation of such an object), verbal MA [54, 94] (e.g. mental composition
of a letter) and arithmetic MA [144](e.g. mental counting, multiplication, etc.)
Few research groups considered these mental activities for BCI applications. According
to the results reported in [6, 60, 63, 106, 107], the communication bit rates and classification
error percentages are comparable with those of other BCIs. Little attention was given to
these mental activities because they did not seem ”natural” to control moving objects
(e.g. prostheses, cursor on a computer screen, etc.). However, HSMAs open the possibility
to implement more control capabilities and in certain cases they are easier to perform than
imagined motor mental activities.
The categorization that we presented so far is merely conceptual. Indeed, different com-
binations of mental activities and functional conditions can be used in a BCI. For instance,
in [89], an approach to decide on the type of recording (invasive or non-invasive), mental
activities, and operating modes that are best suited for locked-in patients is presented. In
this thesis, we use a combination of oscillatory activities, and other hemispheric specialized
MAs in a 2D cursor positioning application (see Chapter 6).
Figure 2.2. Architecture of the BCI system. With respect to the the basic design in Fig. 2.1, the brain
activity monitoring block is substituted by a scalp EEG acquisition system, the signal processing
block is subdivided into the preprocessing and feature extraction modules, and the translation block
into the pattern recognition and action generation modules. The actions are displayed on a computer
screen and constitute a feedback to the subject who can modulate his mental activity to make the
BCI accomplish his intents.
modules. The actions being displayed on a computer screen constitute a feedback to the
subject who can modulate his mental activity in order to make the BCI accomplish his
intents.
EEG is recorded by using an array of electrodes which are affixed on the subject’s scalp;
the acquired signals are amplified, digitized and sent to the computer. EEG signals are
analyzed in segments (EEG-trials) of a given duration that depends on the operation mode
(i.e. whether the BCI operates in a synchronous or asynchronous manner) and type of men-
tal activities. For instance, in a SCPS (see Section 2.3.2) based BCI the EEG-trial duration
is in the order of eight seconds. Each EEG-trial is preprocessed so as to remove external
(e.g. power line noise) and subject generated perturbations (e.g. ocular and muscular
movement artifacts). EEG-trials (free of perturbations) are sent to the feature extraction
module which extracts statistical measurements that are relevant for the recognition of the
MAs in the controlling set, and groups them into a vector (feature vector) which is in turn
sent to the pattern recognition module. The latter computes scores that indicate the like-
lihoods that the feature vector was produced during the performance of each of the MAs
in the controlling set.
Since each MA defines, in the feature vector space, a subset of feature vectors produced
2.4. BCI architecture and operation 13
Figure 2.3. BCI operation. EEG signals are analyzed in segments (EEG-trials) of a given duration
(EEG-trial duration) that depends on the operation mode and type of mental activities. Each
EEG-trial is preprocessed in order to remove the external noise and detect artifacts. Then, relevant
features are computed and grouped into a feature vector which is used to determine the likelihoods
that the EEG-trial was generated during the performance of each MA in the controlling set. Finally,
an action is executed that depends on the BCI action rules. Usually [128, 176], the unique action rule
consists in executing the most likely MA. Other approaches consider the past actions as well [63].
during the performance of such MA, the scores determined by the pattern recognition
module characterize the membership, with respect to each MA subset, of the current feature
vector. Such scores are grouped into a vector of memberships that is sent to the action
generation module which decides on the action that the BCI executes. Such an action
depends on the BCI application and on the action rules. For instance in [128, 175], the
action taken corresponds to the MA associated with the highest membership score while
in [63], the action depends on the vector of memberships and on past actions. The time
length between two consecutive actions is the action period (see Fig. 2.3).
The action rules depend on the BCI operation mode. In synchronous or cue-based
BCIs [18, 128, 176], the system is active, i.e. generates actions, during some ”windows
of opportunity” defined by the operator and the subject. Conversely, in asynchronous
BCIs [19, 63, 107], the system is always active and a neutral action is executed when the
current feature vector is considered as not belonging to any of the MAs in the controlling
14 Chapter 2. Conceptual framework and state of the art
set. Clearly, the latter approach is more suitable for real applications. Indeed, an action
should be executed only in response to any of the MAs in the controlling set. At present, by
adequately adjusting the recognition parameters, it is possible to make the BCI generate an
action only when a certain level of confidence on the recognition exists. On the opposite, it
remains difficult to ensure adequate functioning when the subject is simultaneously engaged
in other activities (e.g. speaking).
Current BCIs can be described by the general architecture in Fig. 2.2. In the follow-
ing, we discuss the implementations of each module, the BCI applications, and evaluation
criteria, in the framework of existing BCIs.
2.5 Preprocessing
The preprocessing module removes the external noise from EEG-trials and detects the pres-
ence of artifacts. In this thesis, the term noise refers to external perturbations, e.g. power
line noise, and artifact to subject generated perturbations, e.g. muscular and ocular artifacts
(see Chapter 3, Section 3.3). In general, the EEG-trials containing artifacts are discard-
ed [70, 87, 101, 176] because the relevant information contained in the trial is masked by the
artifact. Indeed, at frontal, temporal, and occipital locations particularly, ocular artifacts
can exceed EEG [31, 68, 71] in amplitude.
Furthermore, the presence of artifacts can lead to misleading conclusions about the
subject’s ability to control the BCI. Indeed, the subject might be (voluntarily or not)
controlling the BCI by generating artifacts [167].
In [107] it is suggested that artifacts do not need to be identified because the BCI
is trained to recognize the MAs in the controlling set and consequently it automatically
rejects artifacts. In this thesis, the artifacts are detected and special actions are generated to
indicate to the subject whether an ocular or muscular artifact was detected (see Chapter 3).
Thus, the subject can auto-regulate the artifacts he produces.
Electromagnetic and EEG equipment noise are narrow band pass signals. Thus, remov-
ing them through hardware or software filtering is straightforward. Typically [102, 106, 128],
EEG signals are filtered in the 0.5-40 Hz frequency band, i.e. the effective EEG frequency
support. As power line and other electromagnetic noise sources have frequency supports
beyond 40 Hz such filtering removes most of this noise.
It is worth mentioning that while in the BCI framework they are treated as artifacts,
muscular and eye movements are used as information support in other human-machine
interaction systems [8, 158].
In this thesis, the power line noise is removed through notch filtering and the artifacts
are detected by adapting the outlier detection framework presented in [157].
relevant for the recognition of the MAs in the controlling set. Such measurements are
grouped into a feature vector.
Thus, the feature extraction module maps the EEG-trial set into a feature space. The
mapping properties are determined by the type of features (see Chapter 4). For instance,
a mapping can be defined by the coefficients of an autoregressive model [6, 123, 146] fitted
to the EEG-trial, the synchronization coefficients [16], or the powers in different frequency
bands [128, 176]. It appears that the mappings achieving the best recognition performance
in each MA in the controlling set are different(see Chapter 6). Thus, for its optimal oper-
ation a BCI should support several mappings. In [41], low recognition errors are obtained
through the combination of mappings for the off-line classification of EEG-trials recorded
during the imagination of left and right hand finger movements.
Most of the current BCIs use features based on the parametric and nonparametric
spectral representations of EEG signals. In fact, such methods were extensively used to
analyze EEG signals recorded during sleep, cognitive functions, epilepsy, and other clinical
applications [78].
Nonparametric spectral representations are obtained through the discrete Fourier trans-
form [114]. In [126], the power in the alpha and beta bands at electrodes located near the
motor cortex are used to discriminate between EEG-trials produced between the imagina-
tion of left and right index finger movements. In [132, 175], the powers in the alpha band
at frontal, central and occipital electrodes are used in a 1D cursor positioning application.
In [107, 106], the powers in 2 Hz wide frequency bands from 8 to 30 Hz at every electrode
are used as features to discriminate between five MAs, namely relaxing, imagination of left
and right hand movements, rotation of a cube, and arithmetic.
Parametric spectral representations include: autoregressive (AR) [123] and autoregressive-
moving-average (ARMA) [87] models for each EEG channel, and multivariate autoregressive
models that characterize all the channels simultaneously [6].
The above mentioned approaches require EEG to be stationary. Since stationarity is not
necessarily satisfied for EEG signals, alternative approaches which do not require station-
arity were considered. Thus, in [129] lower recognition errors, with respect to the analysis
of the powers in the alpha and beta band are reported by using adaptive autoregressive
models. In [57, 58, 59, 60, 61, 62], we used time-frequency analysis to obtain features for the
recognition of imagined left and right index finger movements, and mental multiplication
in a cursor positioning application.
In [110], a set of space filters, optimized to discriminate between EEG-trials generated
during the imagination of left and right index finger movements, are designed using the
eigenvalue feature extraction [53] method. This method basically consists in simultane-
ously diagonalizing the mean autocorrelation matrices of EEG-trials recorded during the
performance of each MA. By using this method in different frequency bands, we derive
in [63] a set of space frequency filters to characterize each MA in the controlling set. Fur-
thermore, in [60] we combine the diagonalization of the mean autocorrelation matrices with
the analysis of time-frequency correlations to derive a set of features to discriminate be-
tween three mental activities, namely imagined left and right index finger movements and
16 Chapter 2. Conceptual framework and state of the art
mental counting. To handle the continuous adaptation to possible changes in the user’s
EEG dynamics, the joint diagonalization of the mean autocorrelation matrices needs to be
done periodically.
Nonlinear feature extraction methods were sporadically used in the BCI context. In [162],
phase space reconstruction of the chaotic attractor associated with signals recorded at elec-
trodes O1 and O2 are used to discriminate between three cognitive mental states, namely
eyes open subject alert, eyes closed subject alert, and eyes closed subject performing visual-
ization tasks. Phase synchronization [16] measures appear to be adequate for the recognition
of certain mental states from EEG. Furthermore in [161] measures of the EEG complexity
are used to provide 1D control.
In this thesis, we use several mappings from the EEG-trial set into different feature
spaces. Such mappings are defined, for example, by the powers in certain frequency bands,
autoregressive coefficients, coherence values, and synchronization. For each MA we select
the mapping that produces the lowest recognition error (see Chapters 4 and 6).
In [63], we use an action rule in which the transition from one action to another depends
on the probability of confusion between these actions. For large confusion probabilities
the transition is done only if the subject confirms it a sufficient number of times. By
improving his performance (i.e. decreasing the confusion probability), the subject can make
the transition faster.
In this thesis, an action has an associated strength coefficient (see Chapter 6, Section 6.4)
which depend on the level of confidence with which the corresponding MA is recognized.
When this level is large the strength with which the action is executed is large.
2.10 Evaluation
The BCI performance can be evaluated as: 1) speed and accuracy in specific applications
and 2) theoretical performance measured as an information transfer rate. The information
transfer rate, as defined in [151], is the amount of information communicated per unit of
time. This parameter encompasses speed and accuracy in a single value. The bit rate
can be used for comparing different BCI approaches and for the measurement of system
improvements [167].
The bit rate (in bits per minute) [46, 177] for a BCI with N mental activities in its
controlling set, mean accuracy pa (i.e. 1 − pa is the mean recognition error), and action
period Tact (in seconds), is:
60 1 − pa
Bit rate = log2 N + pa log2 pa + (1 − pa ) log2
Tact N −1
In Fig. 2.4 we depict the bit rate (in bits per action period) for some typical values of N .
Obviously, these curves make sense for values of pa that are larger than N1 , i.e. the chance
threshold.
1.6
N=5
N=4
0.8
N=3
0.4
N=2
0
Figure 2.4. Bit rate in bits per action period. For a BCI with N mental activities in its controlling
set and mean accuracy pa , the information transfer, in bits per action period is: log2 N +pa log2 pa +
(1 − pa ) log2 1−pa
N −1
2.12 Summary
In this chapter we have presented the general architecture of a brain-computer interface and
considered the possible choices in terms of brain activity monitoring and types of mental
activities.
Because of its relative simplicity, low cost, and high time resolution, scalp recorded
EEG constitutes the most used brain monitoring method in current BCIs. The choice of
mental activities and conditions (evoked and operant) under which they are executed, were
inspired by the results from brain hemispheric specialization studies and behavioral and
cognitive psychological methods.
We focused our attention to scalp recorded EEG based BCIs that are controlled by
mental activities performed under operant functional conditions. The detailed architecture,
operating protocol, and implementation of current BCIs were then discussed. In particular,
we considered: the implementation of the preprocessing, extraction of relevant features
from EEG-trials, recognition algorithms, training protocols, and the rules that govern the
execution of actions by the BCI. In table 2.1, we report the main features of current BCI
implementations 1 .
1
Some systems were not included because their published descriptions did not contained enough infor-
mation about their implementations or parameters choice
20 Chapter 2. Conceptual framework and state of the art
In the next chapters we present our implementation for each module of the general
architecture presented here.
2.12. Summary
Group MAs Electrodes Application Training
EEG-trial duration/ Features Number of subjects time
Action period (milliseconds) Recognition algorithm Bit rate
ABI Project • Relax, imagination of left • F3, F4, C3, Cz, C4, P3, Pz, • Asynchronous control of a Days
European Union and right hand movement, P4 mobile robot
JRC [106, 107] cube rotation, and subtrac- • Power in 2 Hz wide bands • Five subjects
tion from 8 to 30 Hz • 33 bit/min (max)
• 1000/500 • Neural network
EPFL • Imagined left and right fin- • Fp1, Fp2, F7, F3, F4, F8, • Asynchronous 2D Object Days
Switzerland [63] ger movements, mental count- T3, C3, C4, T4, T5, P3, P4, positioning
ing, and object rotation. T6, O1, O2 • Six subjects
• 2000/500 • Several types of feature vec- • 25 bits/min (avg)
tors ( see Chapter 4) 35 bits/min (max)
• Online kernel based algo-
rithm (Chapter 5)
Neil Squire Foun- • Recognition of movement • Bipolar recordings: F1- • Asynchronous switch Weeks
dation imagination against other FC1, Fz-FCz, F2-FC2, FC1- • Seven subjects
Canada [19, 101] MAs C1, FCz-Cz, FC2-C2 • 51 bits/min (max)
• 1000/62.5 • Bi-scale wavelength analysis
• One-Nearest neighbor clas-
sifier
21
22
Comparison of current BCI systems (continued from previous page)
Technical Univer- • Imagination of left and right • 2 electrodes 2.5 cm anterior • Synchronous virtual key- Days
sity of Graz hand, and foot movements and posterior to electrode po- board, hand orthosis control,
Tsinghua Univer- • Steady state visual evoked • O1 and O2 • Synchronous selection of Minutes
sity response • Identification of the peaks targets on a panel for environ-
China [28, 181] • 3000/3000 in the spectrum, correspond- mental control
ing to the desired choice fre- • Thirteen subjects
quency • 27 bits/min (avg)
University of Illi- • P300 component of the • Fz, Cz, Pz, O1, O2 •Synchronous 6 × 6 virtual Minutes
nois event related potentials • Averaging keyboard
USA [40, 46] • 1500/1500 • Thresholding • Ten subjects
• 9 bits/min
2.12. Summary
Comparison of current BCI systems (continued from previous page)
University of • P300 component of the • Fz, Cz, Pz, P3, P4 • Synchronous control of five Minutes
Rochester event related potentials • Averaging elements in a virtual apart-
USA [11, 12] • 1600/1600 • Thresholding ment
• Nine subjects
• 12 bits/min (avg)
University of • Control of slow brain poten- • Fz, Pz, Cz • Synchronous on/off switch Months
Tübingen tials • Low-pass filtering • Eleven locked-in patients
Germany [18, 73] • 8000/8000 • Thresholding • 6 bits/min (avg)
Wadsworth center • Mu and beta rhythm modu- • 64 EEG electrodes Synchronous 2D positioning Weeks
USA [104, 103, lation • Power in the mu and beta of a cursor
176] • 200/100 band • Eight subjects
• Linear classifier • 22.5 bits/min (avg)
23
24 Chapter 2. Conceptual framework and state of the art
Acquisition and
Preprocessing 3
“Nature uses as little as possible of anything” Johannes Kepler
3.1 Introduction
In the previous chapter we presented the general architecture of a BCI based on scalp record-
ed electroencephalogram, and discussed different implementations and operation modes. In
this chapter we present a review of the physiological principles of electroencephalography,
the recording procedure, and the methods we use to remove external noise and detect ar-
tifacts. A more detailed description of electroencephalography and related fields can be
found in [115].
The extraction of information from EEG data is hindered by external noise and sub-
ject generated artifacts. Most sources of external noise can be avoided by appropriately
controlling the environment in which the measurement takes place. Thus, power line noise
can be easily filtered since it occupies a narrow frequency band that is located beyond the
EEG band.
Subject generated artifacts (eye movements, eye blinks and muscular activity) can pro-
duce voltage changes of much higher amplitude than the endogenous brain activity. Even
when artifacts are not correlated with tasks, they make it difficult to extract useful informa-
tion from the data. In this situation the data are discarded and the subject is notified by a
special action executed by the BCI. If the data containing artifacts were not discarded they
could lead to misleading conclusions about the controlling performance of a subject. For
instance, a subject could (voluntarily or not) be controlling the BCI by producing artifacts.
25
26 Chapter 3. Acquisition and Preprocessing
Figure 3.1. Origins of the rhythmic activity observed in EEG signals. The signal recorded at
a particular electrode is composed of rhythms whose frequencies are visible in the signal power
spectral density. These rhythms are produced by neuronal oscillators whose natural frequencies are
determined by their internal cytoarchitecture.
The generators of electric fields that can be registered with scalp electrodes are groups of
neurons with uniformly oriented dendrites. Neurons communicate with each other by send-
ing electrochemical signals from the synaptic terminal of one cell to the dendrites of other
cells. These signals affect dendritic synapses, inducing excitatory and inhibitory postsynap-
tic potentials [44, 174]. The EEG is a result of the summation of potentials derived from
the mixture of extracellular currents generated by populations of neurons. Hereby the EEG
depends on the cytoarchitectures of the neuronal populations, their connectivity, including
feedback loops, and the geometries of their extracellular fields. The main physical sources
of scalp potentials are the pyramidal cells of cortical layers III and V1 .
The appearance of EEG rhythmic activity in scalp recordings results from the coordi-
nated activation of groups of neurons, whose summed synaptic events become sufficiently
large. The rhythmic activity may be generated both by pacemaker neurons having the in-
herent capability of rhythmic oscillations, and by neurons which cannot generate a rhythm
on their own but can coordinate their activity through excitatory and inhibitory connections
in such a manner that they constitute a network with pacemaker properties. The latter
may be designated as neuronal oscillators [174]. The oscillators have their own discharge
frequency (Fig. 3.1) which depends on their internal connectivity. The neuronal oscillators
start to act in synchrony after application of external sensory stimulation or hidden signals
from internal sources, e.g. resulting from cognitive loading.
1
The brain cortex is composed of six layers, namely molecular layer (I), external granular layer (II),
external pyramidal layer (III), internal granular layer (IV), internal pyramidal layer (V) and polymorphic
or multiform layer (VI)
3.2. An overview of electroencephalography 27
The usual classification of the main EEG rhythms based on their frequency ranges is as
follows: delta (2 to 4 Hz), theta (4 to 8 Hz), alpha (8 to 13 Hz), beta (13 to 30 Hz), and
gamma (higher than 30Hz). However, this classification only partially reflects the functional
variation of rhythmic activities. For example, EEG rhythms within the alpha range may be
distinguished by their dynamics, place of generation and relation to certain behavioral acts.
Since the pioneering work of Hans Berger in 1929 [15], the main EEG rhythm (the alpha
one) has been known. This rhythm is typical of a resting condition and disappears when the
subject perceives a sensory signal or when he makes mental efforts. It was shown that the
alpha rhythm is generated by reverberating propagation of nerve impulses between cortical
neuronal groups and some thalamic nuclei, interconnected by a system of excitatory and
inhibitory connections and resulting in rhythmic discharges of large populations of cortical
neurons [33].
The theta rhythm originates from interactions between cortical and hippocampal neu-
ronal groups [108]. It appears in periods of emotional stress and during rapid-eye-movement
sleep.
The delta rhythm appears during deep sleep, anesthesia, and is also present during
various meditative states involving willful and conscious focus of attention in the absence
of other sensory stimuli [48].
The neuronal oscillators, which generate the beta rhythm are located presumably inside
the cortex [33]. The beta rhythm is typical of periods of intense activity of the nervous
system and occurs principally in the parietal and frontal regions.
The basis for gamma oscillations is interneuronal feedback with quarter-cycle phase lags
between neurons situated close to each other in local areas of the cortex [51]. It is thought
that gamma oscillations are associated with attention, perception and cognition.
Most of the rhythms are rather widespread in brain structures. Induced gamma, theta
and alpha rhythms were found in cortex, hippocampus, thalamus, and brain stem. In [50],
the expression “common modes” was used for the existence of similar rhythms in various
networks of the brain. This may play a role in the integration of activities of neuronal
oscillators distributed over various brain structures. The candidate mechanism for such
integration is coordination of the distant neuronal oscillators activity. The coordination
concept (see Chapter 4 for a mathematical treatment) encompasses the interaction in time
(as measured by the correlation function), frequency (as measured by the coherence func-
tion), time-frequency (as measured by the ambiguity function), and phase (as measured by
the synchronization function).
The analysis of EEG rhythms and their interactions provide indices that are correlated
with mental states such as attention [65], memory encoding [156], motor imagery [7, 128,
176] and perception/recognition [159].
28 Chapter 3. Acquisition and Preprocessing
Figure 3.2. Electrodes placement according to the 10-20 international system. Even numbers
indicate electrodes located on the right side of the head and odd numbers indicate electrodes on
the left side. Capital letters are used to reference each cortical zone, namely frontal (F), central
(C), parietal (P), temporal (T), and occipital (O). Fp and A stand for frontal pole and auricular
respectively. The designation 10-20 comes from the percentage ratio of the inter-electrode distances
with respect to the nasion-inion distance.
other sources. The perturbation sources include: electromagnetic interferences, eye blinks,
eye movements and muscular activity (particularly head muscles). While the terms “noise”
and “artifact” are often used interchangeably, in this thesis the term noise is used for
external perturbations (e.g. power line noise) and artifact for subject related perturbations
(e.g. muscular and eye movement artifacts).
• Eye blink and eye movement artifacts. Eye blink artifacts are very common in EEG
data; they produce low-frequency high-amplitude signals that can be quite greater
than EEG signals of interest (see Fig. 3.3c). Indeed, while regular EEG amplitudes
are in the range of -50 to 50 microvolts eye blink artifacts have amplitudes up to 100
microvolts.
Eye movement artifacts are caused by the reorientation of the retinocorneal dipole [121].
They are recognized by their quasi square shape and their amplitude in the range of
that of regular EEG [121].
Eye blink and eye movement artifacts (henceforth called ocular artifacts) often occur
at close intervals as shown in Figure 3.3c. They are mainly reflected at frontal sites
30 Chapter 3. Acquisition and Preprocessing
(e.g. electrodes Fp1, Fp2) although they can corrupt data on all electrodes, even those
at the back of the head.
1000
0 (a)
500
50 0
0 2 0 10 20 30 40 50 60
Power line noise interference (electrode O1)
50 1500
Amplitude [µV]
1000
0 (b)
500
50 0
0 2 0 10 20 30 40 50 60
Eye movement and eye blink artifacts (electrode Fp1)
6000
75
Amplitude [µV]
4000
0 (c)
2000
60
0
0 2 0 10 20 30 40 50 60
Muscular artifact (electrode T3)
75 1500
Amplitude [µV]
1000
0 (d)
500
50
0
0 0.5 1 1.5 2 0 10 20 30 40 50 60
Time [s] Frequency [Hz]
Figure 3.3. EEG signals perturbed by noise and artifacts and their corresponding power spectral
densities (PSD) (a): clean EEG signal recorded at electrode T3. (b): EEG signal, recorded at
electrode O1, perturbed by power line noise. The corresponding PSD shows clearly the perturbation
at 50 Hz. (c): Signal recorded at electrode Fp1 containing an eye movement (left) and eye blink
artifacts (right). The corresponding PSD reveals a concentration of the power in the theta band
(4-8 Hz). (d ): Signal recorded at electrode T3, containing a muscular movement artifact. The
corresponding PSD shows that the power is concentrated in the beta band (13 to 30 Hz).
32 Chapter 3. Acquisition and Preprocessing
As mentioned in Section 3.2.3, the raw EEG-trial is first re-referenced with respect to
the average of the EEG channels. In addition, the time average of every EEG channel is
subtracted from the corresponding EEG channel. Therefore, the following relations hold.
Nspt
P−1
sm (n) = 0 m = 1, . . . , Ne
n=0
Ne
P
sm (n) = 0 n = 0, . . . , Nspt − 1
m=1
1 + a2 − 2a1 z −1 + (1 + a2 ) z −2
Hn (z) = (3.1)
1 − a1 z −1 + a2 z −2
where
2πfn πβn
2 cos fs 1 − tan fs
a1 = a2 =
πβn πβn
1 + tan fs 1 + tan fs
fn is the notch frequency at which there is no transmission through the filter, and fs is the
sampling frequency. Within the frequency band centered at fn and of width βn (3-dB band)
all signal components are attenuated by more than 3 dB. The smaller βn the lower the
attenuation of the notch frequency (see Fig . 3.4).
To determine the tradeoff between the width of the 3-dB band and the attenuation of
the notch frequency, we estimate the power line noise level by measuring the signals coming
from the electrodes before the conducting gel was applied. Depending on this level we select
the adequate value of βn using the graph depicted in Figure 3.4b.
If no artifact is detected in the raw EEG-trial S̃, the rows of the preprocessed EEG-
trial S (that is sent to the feature extraction module) are obtained through the difference
equation (which is obtained directly from Eq. 3.1):
sm (n)−a1 sm (n−1)+a2 sm (n−2) = (1 + a2 ) s̃m (n)−2a1 s̃m (n−1)+(1 + a2 ) s̃m (n−2) (3.2)
for m = 1, . . . , Ne .
Notch filter transfer function (modulus) Attenuation of the notch frequency for different values of β
n
5
−30
0 Value of β corresponding to
n
an attenuation level of 35 dB
−5 −35
−10
−40
β = 0.2 Hz
dB
dB
−15
n
β = 1 Hz
−20 n
β = 2 Hz
n −45
−25
−30
−50
−35
48 49 50 51 52 0 0.5 0.85 1.5 2
Frequency [Hz] βn [Hz]
(a) (b)
Figure 3.4. Notch filter characteristics. (a): Modulus of the notch filter (centered at the power line
frequency, i.e. 50 Hz) transfer function. (b): The attenuation of the notch frequency increases with
the width of the 3-dB band, βn . The power line noise should be estimated in order to select the
adequate value of βn .
Ocular artifacts have large amplitudes, their spectral content is mainly concentrated in
the theta band and are more prominent at frontal pole electrodes, i.e. Fp1 and Fp2. As it
can be seen in Fig. 3.5, the time-frequency representation of a signal containing a series of
ocular artifacts exhibits an abnormal concentration of the power in the theta band when
ocular artifacts appear.
Muscular artifacts have amplitudes in the order of that of regular EEG but their spectral
content is concentrated in the beta band. These artifacts are more noticeable in central
temporal and parietal electrodes, i.e. electrodes T3, T4, T5, P3, P4 and T6 [164]. As
depicted in Fig. 3.6, the time-frequency representation of a signal containing a muscular
artifact reveals the presence of the artifact by exhibiting an abnormal concentration of the
power in the beta band.
Artifacts can be considered as singular events in the time-frequency plane that appear
randomly in EEG signals. To detect the presence of artifacts in an EEG-trial we divide it
into one-second long segments (that overlap by 500 milliseconds) and check if an artifact
is present in any of the segments. For instance, if the EEG-trial is 1500 milliseconds long,
two segments are considered, namely from zero to 1000 milliseconds and from 500 to 1500
milliseconds.
The detection of an artifact in a one-second long segment (we call it artifact detection
block ADB) is based on the following two facts. First, an ocular artifact implies that the
power spectral densities of the signals at electrodes Fp1 and Fp2 are concentrated in the
theta band and second, a muscular artifact at a given electrode makes its power spectral
density concentrated in the beta band.
34 Chapter 3. Acquisition and Preprocessing
40
20
0
−20
−40
−60
Time [s]
−80
0 2 3 4 5 6 7 8 9 10
Signal spectrogram
50
40
Frequency [Hz]
30
20
10
0
0 1 2 3 4 5 6 7 8 9
−4 −2 0 2 4 6 dB
Figure 3.5. Top: Signal at electrode Fp1 containing three ocular artifacts delimited by the dashed
lines. There is a considerable difference of amplitudes between the first and third artifact and the
clean part of the signal. However, the amplitudes present in the second artifact are in the range
of that of the clean part. A simple threshold on the signal amplitude is therefore insufficient to
reliably detect the ocular artifacts. Bottom: Time-frequency representation of the signal. The
times at which the ocular artifacts appear are characterized by a concentration of the signal power
in the theta band. Thus, the frequency domain constitutes a good candidate to host the detection
of ocular artifacts. Furthermore, it is important to note that an ocular artifact generally implies a
strong correlation between the signals recorded at electrodes Fp1 and Fp2. Therefore, we take into
account the frequency content of both electrodes in the detection procedure.
The time-frequency representation was obtained using the short term Fourier transform [30] which
breaks the signal into chunks (which usually overlap each other) and computes the Fourier transform
of each chunk.
3.4. EEG preprocessing 35
−50
0 0.5 1 1.5 2 2.5 3 3.5 4
Time [s]
Signal spectrogram
80
60
Frequency [Hz]
40
20
0
0 0.5 1 1.5 2 2.5 3 3.5
−5 −4 −3 −2 −1 0 1 2 3 4 5 dB
Figure 3.6. Top: Signal at electrode T3 containing a muscular artifact which is delimited by
the dashed line. The difference between the signal amplitudes in the clean part and those in the
muscular artifact is not as important as in the case of ocular artifacts. Bottom: Time-frequency
representation of the signal. The signal power is concentrated in the beta band at the periods in
which the artifact is present. As in the case of ocular artifacts, the frequency domain appears as
more suitable than the time domain to host the detection of muscular artifacts. Muscular artifacts
are more noticeable in temporal and parietal electrodes, i.e. electrodes T3, T4, T5, P3, P4 and T6.
We thus, take into account the frequency content of the signal recorded at these electrodes in the
detection procedure.
36 Chapter 3. Acquisition and Preprocessing
Figure 3.7. Set of clean ADBs in the space of their power spectral densities. The shape of this
set depends on the subject and the environmental conditions at the time of recording, hence a
calibration phase to adjust the artifact detection parameters is needed. The initial shape of the set
of clean vectors is approximated by a sphere whose parameters are estimated using the calibration
set.
From the above considerations it can be said that in the space of ADBs power spectral
densities ℵ, the clean ADBs lie close to each other. This means that the set of clean
ADBs lies in a small region of the space that is surrounded by ADBs containing artifacts
(see Fig. 3.7). The shape of the set of clean ADBs depends on the subject and on the
environmental conditions at the time of recording. Hence, the detection parameters need
to be adapted at the beginning of each recording session (calibration phase).
For reasons of robustness and execution speed, the detection of ocular and muscular
artifacts is performed separately. The space in which ocular artifacts are detected (ocular
space) is composed of vectors containing the powers in 2 Hz wide bands from 2 to 40 Hz
at electrodes Fp1 and Fp2. The space in which muscular artifacts are detected (muscular
space) is composed of vectors containing the powers (in the same bands as in the ocular
space) at electrodes T3, T4, T5, P3, P4 and T6. Therefore, the vectors are 38 and 114
dimensional in the ocular and muscular spaces respectively. The band powers are estimated
using the Welch method, presented in Chapter 4 (Section 4.3).
The detection procedure is the same for both types of artifacts. Only its parameters
need to be adapted to each artifact type during the calibration phase which lasts for a period
varying from five to ten minutes. During the calibration, the subject is asked to blink his
eyes and to execute slight head and hand movements, about 30 times each, at randomly
chosen times. The resulting EEG is segmented into ADBs and the ocular and muscular
vectors are computed for each ADB. At the end of the calibration phase two sets (one set
per type of artifact) of vectors are available. In each of these sets we approximately know
the percentage of vectors corresponding to ADBs containing artifacts (the exact percentage
cannot be known since the subject could have generated additional artifacts).
In the following we present the general detection procedure which was adapted from the
novelty detection framework presented in [147, 157].
3.4. EEG preprocessing 37
Let ℵ be the space of vectors computed from every possible ADB. We call artifact (clean)
vector a vector resulting from an ADB that contains (does not contain) an artifact. The
shape of the set of clean vectors is unknown. To effectively discriminate between clean and
artifact vectors we seek for a criterion that evaluates whether or not a given vector belongs
to the clean set.
The detection criterion is built using the calibration set ℵcal = {V1 , . . . , VNcal } ⊂ ℵ where
Ncal is the number of ADBs recorded in the calibration phase. Since we ask the subject
to produce a certain number of artifacts we approximately know the fraction of artifact
vectors in the calibration set1 . We denote as ra the expected fraction of artifact vectors.
From the considerations in Section 3.4.2, we know that the clean vectors belonging to
the calibration set must lie in a compact region of ℵ (the assumption of compactness is
reasonable since the clean vectors are close to each other with respect to their Euclidean
distance). To start, we assume that this region can be approximated by a sphere of radius
Rc centered at Cc ∈ ℵ (see Fig. 3.7). The radius and the center are found by solving the
optimization problem:
N
!
X cal
under constraints
for i = 1, . . . , Ncal
where κ is a penalization constant whose value is linked to the fraction of artifact vectors
(see Eq. 3.19) and k·kℵ is the Euclidean norm in the space ℵ. The positive slack variable ξi
controls the position of Vi with respect to the approximating sphere. Indeed, if the value
of ξi at the optimum is larger than zero then, Vi lies outside the approximating sphere and
is therefore considered as an artifact.
To solve the optimization problem (3.3) under constraints (3.4) and (3.5), one in-
troduces positive Lagrange multipliers µ1 , . . . , µNcal , γ1 , . . . , γNcal to obtain the primal La-
grangian [130]:
N
X cal N
X cal NXcal
The primal Lagrangian should be minimized with respect to the primal variables, Rc , ξi , Cc
and maximized with respect to the dual ones, γi , µi . Taking derivatives of Lg with respect
1
Such fraction is only approximately known since the subject could have produced more artifacts
38 Chapter 3. Acquisition and Preprocessing
to the primal variables Rc , ξi , Cc and setting them to zero leads to the following results.
N
Xcal
∂Rc Lg = 0 ⇒ γi = 1 (3.7)
i=1
N
Xcal
∂Cc Lg = 0 ⇒ γi Vi = Cc (3.8)
i=1
∂ξi Lg = 0 ⇒ γi + µi = κ (3.9)
By replacing (3.7), (3.8), and (3.9) in (3.6), one obtains the dual optimization problem:
N
Xcal N
Xcal
0 6 γi 6 κ ; i = 1, . . . , Ncal (3.11)
where hVi1 , Vi2 iℵ is the inner (scalar) product of Vi1 , Vi2 . The dual optimization problem
can be easily solved using standard quadratic optimization techniques [168]. By abuse of
notation, we continue to write γi , µi , ξi for the values, at the optimum, of these parameters.
Thus, the center and the radius of the approximating sphere are given by:
N
X cal
Cc = γi V i (3.12)
i=1
2
Rc2 =
Vî − Cc
ℵ
N
X cal N
X cal
= hVi , Vi iℵ − 2 γi Vî , Vi ℵ + γi1 γi2 hVi1 , Vi2 iℵ (3.13)
i=1 i1,i2=1
where Vî is a vector that is on the approximating sphere, i.e. 0 < γî < κ (see Fig. 3.8).
At the optimum, the Karush-Kuhn-Tucker conditions [95] imply that the following
relations hold.
γi Rc2 + ξi − kVi − Cc k2ℵ = 0 (3.14)
µi ξi = 0 (3.15)
The position of Vi with respect to the approximating sphere depends on the value of γi .
Three possibilities exist:
Figure 3.8. Position of Vi with respect to the approximating sphere for different values of γi . For
values of γi in [0, κ[, the corresponding Vi is considered as a clean vector. Conversely, if γi = κ
the corresponding Vi is considered as a clean or artifact vector depending on the value of the slack
variable ξi .
To decide whether a vector Ṽ , not belonging to the calibration set, is an artifact vector
or not, we compute its square distance to the center of the approximating sphere:
2 D E N
Xcal D E N
X cal
Ṽ − Cc
= Ṽ , Ṽ −2 γi Ṽ , Vi + γi1 γi2 hVi1 , Vi2 iℵ (3.20)
ℵ ℵ ℵ
i=1 i1,i2=1
Ṽ is considered as an artifact if the detection ratio is larger than 1 and as a clean vector
otherwise.
It is worth mentioning that (3.20) depends only on those calibration vectors that are at
the boundary or outside the approximating sphere (indeed, the vectors inside the approx-
imating sphere have their corresponding γ equal to zero). Such vectors are usually called
support vectors [157].
So far, we assumed that the shape of the set of clean vectors could be approximated by
a sphere. This approximation permitted to obtain a simple detection criterion through the
solution of a standard quadratic optimization problem.
However, there is no a priori reason that makes the sphere the preferred approximation
shape for the set of clean vectors. In certain cases, especially when the clean set is non-
convex the sphere approximation is clearly flawed. Thus, we need to consider more flexible
shapes to approximate the clean set. This can be easily done by means of the ”Kernel
trick” [1] which consists in replacing the inner products h·, ·iℵ in the the detection proce-
dure by a kernel function K (·, ·) that satisfies the Mercer conditions [1] (see Chapter 5,
Section 5.2). One can show [169] that the latter amounts to project the space ℵ into a high
(possibly infinite) dimensional space Jℵ , through a map J , such that K (·, ·) is the inner
product in Jℵ . This means that the following relation holds.
where σ is the Gaussian kernel parameter. One can show that for fixed κ, the smaller σ the
smaller the number of artifact vectors in the calibration set [157]. Because of the definition
of the Gaussian kernel in terms of the ratio between the distance of its arguments and σ,
3.4. EEG preprocessing 41
we discuss the influence of this parameter by considering its normalized version, σr = ∆σm
where ∆m is the minimum distance between two different calibration vectors.
From relation (3.19) linking κ to the expected fraction of artifact vectors in the cali-
bration set, we can deduce that for fixed σ, the larger κ the smaller the fraction of artifact
vectors in the calibration set. In geometrical terms we can think of κ as a factor limiting
the generalized volume of the approximating region.
For the sake of visualization we illustrate the role of σr and κ in a 2D toy problem.
In Fig. 3.9 we illustrate the influence of the Gaussian kernel parameter (for fixed κ) on
the shape of the approximating region. In particular, a small σr makes the approximating
region over-fit the data while a large σ makes the approximating region become a sphere.
In Fig. 3.10 we illustrate the influence of κ (for fixed σ, which amounts to fix the shape of
the approximating region) on the extent of the approximating region.
Thus, we can control the shape and volume of the approximating region through σ
and κ respectively. The adequate choice of these parameters depends on the data. In
Fig. 3.11 we report the fraction of artifact vectors as a function of σr for different values
of κ (detection curves). As predicted by the relation (3.19), κ establishes an upper bound
on the fraction of artifact vectors. This means that it is possible to fix κ by using the
expected fraction of artifact vectors and then adjust σ to match the requirements in terms
of detection sensibility.
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
σr=11.43 σr=12.78 σr=14.00
1 1 1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
σr=15.12 σr=25.56 σr=40.41
1 1 1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
Figure 3.9. Influence of the Gaussian kernel parameter (for fixed κ = 0.05) on the shape of the
approximating region. The data are represented by the black dots. The darker the region the smaller
the detection ratio computed using (3.21). The white zone surrounding the approximating region
corresponds to the region in which the rejected data lie.
As it can be seen, the shape of the approximating region is effectively controlled by σr . In particular,
the smaller σr the smaller the fraction of rejected data (rejected data corresponds to artifact vectors
in the framework of artifact detection). As σr increases the shape of the approximating region
becomes more spherical.
3.4. EEG preprocessing 43
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
κ=0.063 κ=0.048 κ=0.038
1 1 1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
κ=0.032 κ=0.028 κ=0.024
1 1 1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
Figure 3.10. Influence of the parameter κ (for fixed σr = 15) on the volume of the approximating
region in the context of the toy problem considered in Fig. 3.9. Once the shape of the approximating
region is fixed by σr , its volume is limited by κ. Thus, the larger κ the smaller the fraction of rejected
data (or the larger the allowed volume of the approximating region).
44 Chapter 3. Acquisition and Preprocessing
0.5
κ=0.02
0.45
0.4
0.35
κ=0.03
Fraction of rejected data
0.3
0.25
κ=0.05
0.2
0.15
κ=0.1
0.1
κ=0.2
0.05
0
10 15 20 25 30
σr
Figure 3.11. The parameters σr and κ allow us to control the shape and the volume of the approx-
imating region respectively. The adequate selection of these parameters is data dependent. The
detection curves (for the toy problem of Figs. 3.9 and 3.10 ) depicted here show the joint influence
of the detection parameters on the fraction of rejected data. The limiting role of κ becomes evident
on the detection curves.
3.4. EEG preprocessing 45
100
Amplitude [µV]
50
−50
−100
0 2 4 Time[s] 6 8 10
Detection ratio for σr=4
1.5
Detection threshold
Detection ratio
0.5
0
0−1s 1−2s 2−3s 3−4s 4−5s 5−6s 6−7s 7−8s 8−9s 9−10s
Detection ratio for σr=40
1.5
Detection threshold
Detection ratio
0.5
0
0−1s 1−2s 2−3s 3−4s 4−5s 5−6s 6−7s 7−8s 8−9s 9−10s
ADB
Figure 3.12. Detection of ocular artifacts. Top: Signal recorded at electrode Fp1 containing ocular
artifacts. Middle: Detection ratio (σr = 4 and κ = 4ra3Ncal ) for the ADBs (one-second long segments
overlapped by half a second) of the signal on the top. Bottom: Detection ratio for σr = 40 and
κ = 4ra3Ncal . In this example, the algorithm for σr = 4 fails to detect the artifact in the middle. On
the opposite, σr = 40 leads to false artifact detections.
46 Chapter 3. Acquisition and Preprocessing
Signal at electrode T3
100
50
Amplitude [µV]
−50
−100
0 0.5 1 1.5 2 2.5 3 3.5 4
Time[s]
0.8
0.6
0.4
0.2
0
0−1 s 0.5−1.5 s 1−2 s 1.5−2.5 s 2−3 s 2.5−3.5 s 3−4 s
0.8
0.6
0.4
0.2
0
0−1 s 0.5−1.5 s 1−2 s 1.5−2.5 s 2−3 s 2.5−3.5 s 3−4 s
ADB
Figure 3.13. Detection of muscular artifacts. Top: Signal recorded at electrode T3 containing
muscular artifacts. Middle: Detection ratio (σr = 4 and κ = 4ra3Ncal ) for the ADBs (one-second
long segments overlapped by half a second) of the signal on the top. Bottom: Detection ratio for
σr = 40 and κ = 4ra3Ncal .
3.5 Summary
The presence of ocular and muscular artifacts makes it difficult to extract useful informa-
tion that can be exploited by the BCI. Furthermore, they can lead to erroneous conclusions
about the control performance of a subject. To prevent these issues we discard the da-
ta containing artifacts. To implement the rejection criterion we considered the frequency
domain characteristics of artifacts which make them easily identifiable from regular EEG.
By using an adapted version of the novelty detection algorithm presented in [157] we
can easily control the artifact detection sensibility through two parameters that can be set
by the operator in an interactive way.
In Fig. 3.14 we summarize the function of the acquisition and preprocessing modules
within the BCI system. The raw EEG-trials delivered by the acquisition module are re-
referenced and their power line noise is filtered. If the EEG-trial contains muscular or
ocular artifacts the BCI does not attempt to generate an action command from such a
trial. Instead, it notifies the subject by executing predefined actions depending on whether
ocular or muscular artifacts were detected.
Figure 3.14. Role of the EEG acquisition and preprocessing modules. The non-preprocessed EEG-
trials delivered by the acquisition module are re-referenced and their power line noise is filtered.
If the EEG-trial contains muscular or ocular artifacts the BCI does not attempt to generate an
action command from such a trial. Instead, it notifies the subject by executing predefined actions
depending on whether ocular or muscular artifacts were detected.
48 Chapter 3. Acquisition and Preprocessing
Feature extraction 4
“We are always paid for our suspicion by finding
what we suspect.” Henry David Thoreau
4.1 Introduction
In the previous chapter we presented the preprocessing procedure through which the exter-
nal noise is removed and the EEG-trials containing artifacts are detected and discarded. In
this chapter we focus on the estimation of statistical measurements (or features) from the
perturbation free EEG-trials delivered by the preprocessing module. The features comput-
ed on a given EEG-trial are grouped into a vector called feature vector that is sent to the
pattern recognition module which evaluates the likelihoods that the EEG-trial (represented
by its feature vector) was produced during the execution of the MAs in the controlling set
(see Fig. 4.1).
Features need to reflect properties of EEG that are relevant for the recognition of MAs.
The choice of adequate features to characterize EEG has been the object of active research
during the last decades [115, 174]. As a matter of fact, the techniques used to analyze EEG
evolved in parallel with the development of novel signal processing concepts. In particular,
the analysis of the generalized interaction (in time, frequency, and phase) between EEG
channels has emerged as a tool to study EEG data [4, 43].
A complete analysis that takes into account time, frequency and phase would result in
a very large number of features (Section 4.2.2) and consequently a high dimensional feature
vector. Because of the particular requirements of BCI applications, according to which a
continuous adaptation of the recognition models and a reasonable training time are required
(Chapter 5), high dimensional feature vectors are clearly non-suitable.
49
50 Chapter 4. Feature extraction
Figure 4.1. The feature extraction module is in charge of computing statistical properties (features)
on an EEG-trial (free of artifacts) S delivered by the preprocessing module. The mappings associated
with each MA in the controlling set, ψ (1) (S), . . . , ψ (NMA ) (S) are computed (x(k) = ψ (k) (S)) and sent
to the pattern recognition module which evaluates the likelihoods that S was generated during the
performance of each MA.
By assuming certain hypotheses on the properties of EEG, less features are required to
characterize an EEG-trial. In this chapter, we present such hypotheses and derive different
mappings from the EEG-trial set into feature spaces. A mapping is associated to a certain
number of hypotheses that are used to define it. As presented in Chapter 6, depending
on the subject a single mapping is not sufficient for the recognition of all the MAs in the
controlling set. Therefore, the best mapping to recognize each MA has to be chosen. Such
choice is carried out according to the optimality criterion presented in Chapter 6. The
mapping associated to MAk is denoted as ψ (k) (see Fig. 4.1).
This chapter is organized as follows. First, the general time-frequency analysis of sto-
chastic signals is considered. Second, the hypotheses that permit to obtain the mappings
are discussed and finally the resulting mappings are presented.
The properties of s can be described in time using first and second order moments computed
on the random variables s(n). These moments are:
• The expectation of s(n): Ep(s(n)) [s(n)], where p (s(n)) is the probability density func-
tion associated with s(n).
• The expectation of the product s(n1 )s(n2 ): Ep(s(n1 ),s(n2 )) [s(n1 )s(n2 )], where p (s(n1 ), s(n2 ))
is the joint probability density function of s(n1 ) and s(n2 ).
Expectations taken with respect to the probability density functions associated with
the random variables s(n) are called ensemble averages. For convenience of notation, we
denote as Es [·] any ensemble average over s.
The signal power Ps and time autocorrelation function Rs (n, τ ) are defined as:
"N −1 #
1 X 2
Ps = Es |s(n)| (4.1)
N
n=0
Rs (n, τ ) = Es [s∗ (n − τ )s(n)] (4.2)
where ∗ stands for the complex conjugate operator1 , n is the time at which Rs is computed,
and τ ∈ {−N +h1, . . . , Ni − 1} is the time lag. Since hPs canibe written as an average
over time of: Es |s(n)| = Rs (n, 0), it follows that Es |s(n)|2 can be considered as the
2
signal
h power i density in the time domain (or power time density PTD). Thus, we can use
2
Es |s(n)| to compute the average, over the PTD of any time function γ(n) as follows.
N −1 h i
1 X
hγ(n)iPTD = γ(n)Es |s(n)|2 (4.3)
N
n=0
1
It is worth nothing that even though we consider real signals, the complex conjugate in the definition
of Rs (n, τ ) facilitates further developments
52 Chapter 4. Feature extraction
The frequency properties of s can be examined using its discrete Fourier transform defined
as:
N −1
X 2πnϑ
ŝ(ϑ) = s(n) exp −j (4.4)
N
n=0
where ϑ is the frequency index. The correspondence between the frequency index ϑ and
the actual frequency f (in Hz) is given by [114]:
fs ϑ N
f= ϑ = 0, . . . , (4.5)
N 2
where fs is the sampling frequency. The values: ŝ N2 , . . . , ŝ(N − 1) correspond to the
negative part of the spectrum of s [135]. In fact, one can easily verify that:
N
ŝ(ϑ) = ŝ∗ (N − ϑ) ϑ = 1, . . . , (4.6)
2
Using the discrete inverse Fourier transform, s can be obtained from ŝ as follows.
N −1
1 X 2πnϑ
s (n) = ŝ(ϑ) exp j (4.7)
N N
ϑ=0
This relation is easily verified by replacing ŝ(ϑ) by its definition (4.4) and using the identi-
ties:
N −1
1 X 2πnϑ
exp j = δd (ϑ) (4.8)
N N
n=0
N
X −1
g(ϑ)δd (ϑ − ϑ′ ) = g(ϑ′ ) ϑ′ = 0, . . . , N − 1 (4.9)
ϑ=0
where δd (·) is the digital delta function which is equal to one at zero and equal to zero
elsewhere.
Similarly to the time domain, first and second order moments can be defined on the
random variables ŝ(ϑ). In particular, the frequency autocorrelation function can be defined
as:
1
Rs (ϑ, υ) = Eŝ [ŝ∗ (ϑ − υ) ŝ (ϑ)] (4.10)
N
where υ is the frequency lag and the normalization factor N1 takes into account the Parseval
identity (4.12). Note that the ensemble average in (4.10) is taken with respect to the joint
probability density function: p(ŝ(ϑ − υ), ŝ(ϑ)).
It is well known that if a new signal s′ is obtained from s through an invertible function
F then:
Es [G(s)] = Es′ G(F −1 (s′ )) (4.11)
where G(·) is any function of s. In the following, for brevity of notation we use Es [·] to
denote any ensemble average over s or any other signal obtained from s.
4.2. An overview of time-frequency analysis for stochastic signals 53
This result can be thought of as a power conservation relation between the time and fre-
quency domains.
Since
h thei signal power, according to (4.12),hcan bei written as an average over ϑ of:
2
1
N Es |ŝ(ϑ)| = Rs (ϑ, 0), it follows that N1 Es |ŝ(ϑ)|2 can be considered as the signal
powerh density i in the frequency domain (or power spectrum density PSD). Thus, we can use
1 2
N Es |ŝ(ϑ)| to compute the average, over the PSD of any frequency function g(ϑ) as:
N −1 h i
1 X 2
hg(ϑ)iPSD = g(ϑ)E s |ŝ(ϑ)| (4.13)
N2
ϑ=0
In particular, when g (ϑ) = exp j 2πτ ϑ
, one obtains the characteristic function of the
N
power spectrum density (i.e. its inverse Fourier transform):
N −1 h i
2πτ ϑ 1 X 2 2πτ ϑ
exp j = 2 Es |ŝ(ϑ)| exp j (4.14)
N PSD N N
ϑ=0
By replacing ŝ(ϑ) by (4.4) in the PSD characteristic function (4.14) and using the
definition of the time autocorrelation function (4.2), we obtain:
" N −1 N −1 N −1 #
2πτ ϑ 1 X X X 2π (n 1 − n + τ ) ϑ
exp j = E
2 s
s∗ (n1 ) s (n) exp j
N PSD N N
n1 =0 n=0 ϑ=0
" N −1 N −1
#
1 X
∗
X
= Es s (n1 ) s (n) δd (n1 − n + τ )
N
n1 =0 n=0
N −1 N −1
1 X
∗ 1 X
= Es [s (n) s (n − τ )] = Rs (n, τ ) (4.15)
N N
n=0 n=0
h i NP
−1
From (4.15) and (4.14) it comes out that Es |ŝ(ϑ)|2 is the Fourier transform of Rs (n, τ ).
n=0
Hence, we can write:
h i N −1 N −1
1 2 1 X X 2πτ ϑ
Es |ŝ(ϑ)| = Rs (n, τ ) exp −j (4.16)
N N N
τ =0 n=0
54 Chapter 4. Feature extraction
Thus, the PSD of s can be obtained by taking the Fourier transform, with respect to
the time lag variable τ of the sum over n of the time autocorrelation functions Rs (n, τ ).
This result constitutes a generalization of the Wiener-Khinchin theorem [135] for stochastic
signals.
Following the same line of reasoning, the characteristic function of the power time
density (4.3) is:
N −1 h i
2πnυ 1 X 2 2πnυ
exp j = Es |s(n)| exp j (4.17)
N PTD N N
n=0
By replacing s(n) by (4.7) in the above relation we obtain the dual form of the Wiener-
Khinchin theorem:
h i N −1 N −1
2 1 X X ∗ 2πnυ
Es |s(n)| = Rs (ϑ, υ) exp −j (4.18)
N N
υ=0 ϑ=0
Notice that the Parseval identity (4.12), the Wiener-Khinchin relation (4.16), and its
dual form (4.18) connect time and frequency ensemble
h averages.
i h i
The time and frequency power densities: Es |s(n)| and N1 Es |ŝ(ϑ)|2 along with the
2
time and frequency autocorrelation functions: Rs (n, τ ) and Rs (ϑ, υ) allow us to indepen-
dently analyze s in time and frequency. We now turn to obtaining TF representations of s
that permit to characterize the power and the correlation in the TF plane
Wigner-Ville transform
The fundamental power based TFR of a signal is its Wigner-Ville transform (WVT) [30].
The WVT of s is defined as:
N −1
1 X ∗ 2πτ ϑ
Ws (n, ϑ) = s (n − τ )s(n) exp −j (4.19)
N N
τ =0
the normalizing factor N1 is introduced to satisfy the marginal properties (4.23) to (4.25).
The frequency version of the WVT is obtained by replacing s(n) by (4.7), in the WVT
definition. This yields:
N −1
1 X ∗ 2πnυ
Ws (n, ϑ) = ŝ (ϑ)ŝ(ϑ − υ) exp −j (4.20)
N2 N
υ=0
By taking ensemble averages on both sides in (4.19) and (4.20), and using the definitions
of time (4.2) and frequency (4.10) autocorrelation functions, we obtain the expected WVT
of s:
N −1
1 X 2πτ ϑ
Es [Ws (n, ϑ)] = Rs (n, τ ) exp −j (4.21)
N N
τ =0
N −1
1 X
∗ 2πnυ
= Rs (ϑ, υ) exp −j (4.22)
N N
υ=0
4.2. An overview of time-frequency analysis for stochastic signals 55
The expected WVT can be considered as an indicator of the signal power density in
time and frequency. Indeed, Es [Ws (n, ϑ)] is real everywhere (since: Ws (n, ϑ) = Ws∗ (n, ϑ))
and it satisfies the marginal properties, i.e. its sum over frequency (4.23) and time (4.24)
gives the signal power density in time and frequency respectively, and the sum over time
and frequency (4.25), scaled by N , gives the signal power.
N −1
"N −1 N −1 #
X 1 X X 2πτ ϑ h i
Es [Ws (n, ϑ)] = Es s∗ (n − τ ) s (n) exp −j = Es |s(n)|2
N N
ϑ=0 ϑ=0 τ =0
(4.23)
N −1
"N −1 N −1 #
X 1 X X 2πnυ 1 h i
Es [Ws (n, ϑ)] = Es ŝ∗ (ϑ) ŝ (ϑ − υ) exp −j = Es |ŝ (ϑ)|2
N2 N N
n=0 n=0 υ=0
(4.24)
N −1 N −1
1 X X
Es [Ws (n, ϑ)] = Ps (4.25)
N
n=0 ϑ=0
It is important to note that Es [Ws (n, ϑ)] is but an indicator of the signal power density.
In fact, it cannot be interpreted in a point-wise sense because of the uncertainty principle,
according to which the time and frequency power densities cannot both be made arbitrarily
narrow1 . In addition, Es [Ws (n, ϑ)] can be negative in some regions of the TF plane [30].
Since the WVT represents the signal in the TF plane, we can generalize the time (4.2)
and frequency (4.10) autocorrelation functions and define the signal TF autocorrelation as:
1
Es [Ws∗ (n − τ, ϑ − υ)Ws (n, ϑ)]
Rs (n, τ, ϑ, υ) = (4.26)
N
where n and ϑ are the time and frequency at which the TF correlation is computed, and τ
and υ are the time and frequency lags respectively.
Ambiguity function
Whereas the WVT seeks to combine power analysis in time and frequency, the fundamental
correlative based TFR, namely the ambiguity function (AF) seeks to combine time and
frequency correlation as embodied by the definitions (4.2), (4.10), and (4.26). The AF, is
defined as the Fourier transform of the product: s∗ (n − τ )s(n) with respect to time:
N −1
1 X ∗ 2πnυ
As (τ, υ) = s (n − τ )s(n) exp −j (4.27)
N N
n=0
The frequency version of the AF is obtained by replacing s(n) by (4.7) in the above defini-
tion.
N −1
1 X ∗ 2πτ ϑ
As (τ, υ) = 2 ŝ (ϑ − υ)ŝ(ϑ) exp j (4.28)
N N
ϑ=0
1
The uncertainty principle and its implications are detailed in [30]
56 Chapter 4. Feature extraction
By taking ensemble averages on both sides in (4.27) and (4.28), and using defini-
tions (4.2) and (4.10), we obtain the expected AF of s:
N −1
1 X 2πnυ
Es [As (τ, υ)] = Rs (n, τ ) exp −j (4.29)
N N
n=0
N −1
1 X 2πτ ϑ
= Rs (ϑ, υ) exp j
N N
ϑ=0
The expected AF satisfies the marginal properties (4.30) and (4.31), i.e. the sum over
the frequency lag gives the time autocorrelation computed at time n = 0 and the sum over
the time lag gives the frequency autocorrelation function computed at frequency ϑ = 0.
N
X −1
Es [As (τ, υ)] = Rs (0, τ ) (4.30)
υ=0
N
X −1
Es [As (τ, υ)] = Rs (0, υ) (4.31)
τ =0
h i
The expected square modulus of the AF, Es |As (τ, υ)|2 is an indicator of the global
TF correlation for all the TF points separated, in time by τ and in frequency by υ. Indeed,
by taking the sum over n and ϑ of the TF autocorrelation definition (4.26) and using the
WVT (4.19) and AF (4.27) definitions, we have:
N −1 N −1 N −1 N −1
X X 1 X X
Rs (n, τ, ϑ, υ) = Es [Ws (n, ϑ)Ws∗ (n − τ, ϑ − υ)]
N
n=0 ϑ=0 n=0 ϑ=0
N −1
1 X 2π(τ2 υ − τ1 υ − τ2 υ)
= Es s∗ (n − τ1 )s(n)s(n − τ − τ2 )s∗ (n − τ ) exp j
N3 N
τ1 ,τ2 ,n,ϑ=0
N
X −1 N
X −1 h i
Rs (n, τ, ϑ, υ) = Es |As (τ, υ)|2 (4.32)
n=0 ϑ=0
This result constitutes a global indicator of the interaction in the TF plane. Its gen-
eralization to the analysis of a multivariate signal permits to characterize the interaction
between its univariate components.
We define the time and frequency inter-correlation functions of sm1 and sm2 as1 :
As in the univariate TF analysis, we call N1 ES [ŝ∗m1 (ϑ)ŝm2 (ϑ)], the power inter-spectrum
density of sm1 and sm2 . In fact, this result generalizes the signal cross-spectrum defini-
tion [135].
Similarly to the time and frequency inter-correlation functions, the inter-WVT and inter-AF
of sm1 and sm2 can be respectively defined as:
N −1
1 X ∗ 2πτ ϑ
Wm1,m2 (n, ϑ) = sm1 (n − τ )sm2 (n) exp −j (4.36)
N N
τ =0
N −1
1 X
∗ 2πnυ
Am1,m2 (τ, υ) = sm1 (n − τ )sm2 (n) exp −j (4.37)
N N
n=0
The TF inter-correlation function of sm1 and sm2 at time n and frequency ϑ for a time
lag τ and frequency lag υ is:
1 ∗
Rm1,m2 (n, τ, ϑ, υ) = ES Wm1,m2 (n − τ, ϑ − υ)Wm1,m2 (n, ϑ) (4.38)
N
The global TF inter-correlation between sm1 and sm2 at time lag τ and frequency lag
υ is given by the sum over time and frequency of the TF inter-correlation function (4.38).
Using (4.36) and (4.37), we obtain:
N
X −1 N
X −1 h i
Rm1,m2 (n, τ, ϑ, υ) = ES |Am1,m2 (τ, υ)|2 (4.39)
n=0 ϑ=0
1
We use the prefix inter in a general sense. When m1 = m2 , this prefix is usually replaced by intra
58 Chapter 4. Feature extraction
In consequence, the expectation of the modulus of the inter-AF of sm1 and sm2 gives
an indication of the global TF interaction between these two signals.
It should be noted that:
h i h i
ES |Am1,m2 (τ, υ)|2 = ES |Am2,m1 (τ, υ)|2
4.2.3 Stationarity
Stationarity of S implies that its statistical properties do not change with time. However,
this condition is hardly met in practice. As we employ statistical moments up to second
order, we consider a weaker form of stationarity called wide sense stationarity. In the
following we use stationarity to refer to wide sense stationarity. Thus, S is stationary if:
• The inter-correlation function of any pair of univariate components sm1 and sm2
depends only upon the time lag τ for every time n, i.e. Rm1,m2 (n, τ ) = Rm1,m2 (τ ).
In particular, when m1 = m2 = m, one has: Rm (n, τ ) = Rm (τ ) = R(−τ ).
Because of the stationarity conditions, the power inter-spectrum density of sm1 and
sm2 (4.35) becomes simply the Fourier transform of the time inter-correlation function:
N −1
1 ∗
X 2πτ ϑ
ES [ŝm1 (ϑ)ŝm2 (ϑ)] = Rm1,m2 (τ ) exp −j (4.41)
N N
τ =0
1
For convenience we assume that N is even
4.2. An overview of time-frequency analysis for stochastic signals 59
and the expected inter-WVT depends only upon the frequency. Indeed, by taking ensemble
averages on both sides in (4.36) we have:
N −1
1 X 2πτ ϑ
ES [Wm1,m2 (n, ϑ)] = Rm1,m2 (τ ) exp −j (4.42)
N N
τ =0
This result implies that the spectral properties of S do not change over time. Furthermore,
by taking ensemble averages on both sides of (4.37), we obtain:
4.2.4 Ergodicity
For a stationary signal S, ergodicity implies that ensemble averages can be replaced by
time averages. Thus, the stationary time inter-correlation function, under the hypothesis
of ergodicity becomes:
N −1
1 X
Rm1,m2 (τ ) = sm1 (n − τ )sm2 (n) (4.44)
N
n=0
Replacing this result in the stationary power inter-spectrum density (4.41) we obtain:
1 1
ES [ŝ∗m1 (ϑ)ŝm2 (ϑ)] = ŝ∗m1 (ϑ)ŝm2 (ϑ) (4.45)
N N
where ŝm1 and ŝm2 are the Fourier transforms of sm1 and sm2 respectively.
If m1 = m2 = m then, N1 |ŝm (ϑ)|2 represents the PSD of sm . Therefore, according
to (4.12) and (4.45), the power of sm , is:
N −1
1 X
Psm = 2 |ŝm (ϑ)|2 (4.46)
N
ϑ=0
Since the sampling frequency is at least twice the maximum frequency present in the
spectrum of sm , i.e. the sampling frequency is chosen in accordance with the sampling
60 Chapter 4. Feature extraction
theorem [119, 150], the second term on the right in (4.47) is close to zero. Therefore, the
following approximation holds.
N
2
2 X
Psm ≈ 2 |ŝm (ϑ)|2 (4.48)
N
ϑ=0
Using this approximation and the correspondence between ϑ and the real frequency (4.5),
we can approximate the power contained in a frequency band, Bf = [f1 ; f2 ] ⊂ 0; f2s , as
follows.
ϑ2
2 X
P̃sm (Bf ) = 2 |ŝm (ϑ)|2 (4.49)
N
ϑ=ϑ1
where:
N fi
ϑi = nint i = 1, 2
fs
The function nint(·) gives the nearest integer to its argument. In particular, it can be
2
fs
said that N22 ŝm nint Nfsf represents the power contained in an N
wide band centered
at f.
Ensemble averages allowed us to theoretically develop the TF analysis framework. In
practice, such averages are difficult to compute as in practice, one has no access to the
signal’s generative mechanism. Thus, under the hypothesis of ergodicity this problem has
been overcome in the framework of stationary signals.
The stationarity and ergodicity hypothesis are used in the stationary PSD (Section 4.3),
autoregressive (Section 4.5), multivariate autoregressive (Section 4.7), and coherence (Sec-
tion 4.4) mappings.
Q
X
S(n) = − A(n, i)S(n − i) + e(n) (4.50)
i=1
4.2. An overview of time-frequency analysis for stochastic signals 61
where Q is the model order, the A(n, i) are N × N matrices and e(n) (the prediction error)
is an N -dimensional zero mean random vector with covariance matrix Ce (n). Thus, S is
completely determined by the parameters of the model.
It can be shown [22] that if S is stationary and ergodic, the matrices A(n, i) are time
independent, i.e. ∀n, A(n, i) = A(i). In this case the linear prediction model is called a
stationary autoregressive model (AR model). On the other hand, if S is non-stationary,
the matrices A(n, i) are time dependent. In this case, the linear prediction model is called
a non-stationary autoregressive model (NAR model).
The hypothesis of existence of a linear prediction model is used in the autoregressive
(Section 4.5), non-stationary autoregressive (Section 4.6), and the multivariate autoregres-
sive mappings (Section 4.7).
where ϕ1 and ϕ2 are the phases of the coupled oscillators, and ǫ is a small value [131]. As
the oscillators considered come from the same physiological system, only synchronization
of order 1 : 1 is considered [16].
Thus under the weak coupling hypothesis, the analysis of the interaction between the
univariate components of S focuses on the computation of the degree of synchronization
between them. Since phase locking implies frequency locking, synchronization should be
determined in narrow frequency bands. This hypothesis is used in the synchronization
mapping in Section 4.8.
So far we have presented the theoretical elements to analyze a multivariate stochastic
signal S in time and frequency. In addition, we established hypotheses on the nature of S
that make it possible to simplify the analysis. As mentioned in Section 4.1, in the framework
1
A self-sustained oscillator is an active system that contains an internal source of energy that is trans-
formed into oscillatory activity which is entirely determined by the oscillator internal parameters. Neuronal
oscillators are good examples of self-sustained oscillators
62 Chapter 4. Feature extraction
where Ne and Nspt are the number of electrodes and number of samples per trial respectively,
is given (for simplicity, we assume that Nspt is even). In addition, as mentioned in Chapter 3
the averages over time and electrode are both equal to zero. This implies:
Nspt
P−1
sm (n) = 0 m = 1, . . . , Ne
n=0 (4.52)
Ne
P
sm (n) = 0 n = 0, . . . , Nspt − 1
m=1
The Welch estimate of the PSD of sm , denoted as Wsm (ϑ), is the average of the PSDs
of the windowed blocks. Using (4.46), we have:
Nβ
1 X
Wsm (ϑ) = |ŝβ (ϑ)|2 (4.53)
Nβ N
β=1
where ŝβ (ϑ) is the Fourier transform of the β-th windowed block.
Finally, using (4.49) the power in the frequency band Bi = [fi,1 ; fi,2 ] is:
ϑi,2 ϑi,2 Nβ
2 X 2 X X
Pm (Bi ) = Wsm (ϑ) = 2
|ŝβ (ϑ)|2 (4.54)
N Nβ N
ϑ=ϑi,1 ϑ=ϑi,1 β=1
where:
N fi,l
ϑi,l = nint l = 1, 2
fs
Variants of this mapping, e.g. taking different frequency bands and subsets of electrodes
(following physiological considerations) are used in numerous current BCIs [106, 128, 132,
176].
50 Signal at electrode T3
Amplitudes [µV]
Block 1 Block 3
Block 2
−50
0 0.5 Time [s] 1.5 2
Block 1
Block 2
Block 3
0 0.2 0.4 0.6 0.8 1 0 0.5 1 0 0.5 1
Windowed block 1
Windowed block 2
Windowed block 3
0 0.2 Time [s] 1 0 0.2 Time [s] 1 0 0.2 Time [s] 1
PSD: block 1
PSD: block 2
PSD: block 3
0 10 20 30 40 50
Frequency [Hz]
Figure 4.2. Welch method to estimate the power spectral density PSD. The signal is segmented into
blocks that can overlap. These blocks are windowed by a Hamming window, their respective PSDs
are computed and averaged. This average constitutes the estimated PSD. The signal under study
was recorded at electrode T3 while the subject was reading a text on a computer screen.
4.5. Autoregressive mapping 65
where ŝm1,β and ŝm2,β are the Fourier transforms of the β-th blocks of signals sm1 and sm2
respectively (see Fig. 4.3).
To compute the feature vector ψC (S), NB frequency bands {B1 , . . . , BNB } are chosen
and the average coherence for each frequency band and pair of EEG channels are computed
and grouped into an Ne (N2e −1) NB dimensional vector:
t Ne (Ne −1)
hC1,2 (ϑ)iB1 . . . hCm1<m2,m2 (ϑ)iBi . . . hCNe −1,Ne (ϑ)iBN NB
ψC (S) = ∈R 2
B
where hCm1,m2 (ϑ)iBi is the average coherence in the frequency band Bi = [fi,1 ; fi,2 ]. Such
average is computed as follows.
ϑi,2
1 X
hCm1,m2 (ϑ)iBi = Cm1,m2 (ϑ) (4.57)
ϑi,2 − ϑi,1
ϑ=ϑi,1
where
N fi,l
ϑi,l = nint l = 1, 2
fs
The coherence function is extensively used as a tool for quantifying the degree of inter-
action between two EEG channels in a frequency band. A large value of the average of the
coherence function in a certain frequency band indicates that the corresponding oscillatory
activities are of the same origin or interact with each other [14, 117, 118, 174].
Qm
X
sm (n) = − am (n, i)sm (n − i) + em (n) (4.58)
i=1
where the am (n, i) are the AR coefficients and Qm is the AR order corresponding to sm ,
and em is the m-th prediction error process.
Furthermore, as stationarity and ergodicity are assumed, it can be shown [22] that the
AR coefficients are time independent. Thus, the AR model for the m-th channel becomes:
Qm
X
sm (n) = − am (i)sm (n − i) + em (n) (4.59)
i=1
66 Chapter 4. Feature extraction
Amplitude [µV]
50 50
0 0
−50 block1
block 2
−50 block1 block 2
block 3 block 3
PSD C3 block2
PSD C3 block3
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Frequency [Hz] Frequency [Hz] Frequency [Hz]
PSD C4 block1
PSD C4 block2
PSD C4 block3
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Frequency [Hz] Frequency [Hz] Frequency [Hz]
1
coherence
0.5
0
0 8 13 20 25 30 35 40 45 50
Frequency [Hz]
Figure 4.3. Estimation of the coherence function between two EEG channels. In this example,
the coherence function is estimated by segmenting both signals into blocks of one second duration,
computing the PSDs of each block and using (4.56).
The signals under study were recorded at electrodes C3 (top left) and C4 (top right) while the
subject was imagining the movement of his left index finger.
4.5. Autoregressive mapping 67
By replacing the ergodicity condition on the time correlation function (4.44) in (4.61),
we obtain the so called Yule-Walker [170, 182] equations:
Qm
X
am (i′ )Rsm (i − i′ ) = −Rsm (i) i = 1, . . . , Qm (4.62)
i′ =1
The AR coefficients can be efficiently found by solving (4.62) using the recursive Levinson-
Durbin algorithm [42, 97].
Since sm is ergodic, it can be shown [22] that em is an independent and identically
distributed (IID) stochastic process with mean zero and finite variance E(Qm ). Therefore,
the spectrum of em is [22]: êm (f) = E(Q
fs
m)
where fs is the sampling frequency.
Taking the z-transform [135] on both sides in (4.59) yields:
Zem (z)
Zsm (z) = Q
(4.63)
Pm
1+ am (i)z −i
i=1
where Zsm (z) and Zsm (z) are the z-transforms of sm and em respectively. The spectrum of
2πf
sm is obtained by evaluating (4.63) along the unit circle in the z-plane, i.e. z = exp j fz :
E(Qm ) E(Qm )
ŝm (f) = != Hm (f) (4.64)
Q
Pm fs
fs 1+ am (i) exp −j 2πfi
fs
i=1
Figure 4.4. AR modelling of the m-th channel as an all-pole filter. The current output sm (n)
depends on the Qm most recent outputs, sm (n − 1), . . . , sm (n − Qm ) and the current input, em (n).
where ηQm is a penalization factor. Since the penalization factor associated with the MDL
criterion is the largest, this criterion gives the smallest AR order (Fig. 4.5). In practice, the
4.6. Non-stationary autoregressive mapping 69
MDL criterion is generally preferred [9]. In the framework of linear prediction, the MDL
criterion takes the general form: log(error) + number of parameters × log(number of samples)
number of samples
The feature vector ψAR (S) is composed of the AR coefficients associated to each channel:
t P
Qm
ψAR (S) = a1 (1) . . . a1 (Q1 ) . . . aNe (1) . . . aNe (QNe ) ∈R m
where am (i) is the i-th AR coefficient and Qm is the AR order associated to the m-th
channel.
The AR mapping does not require to select a set of frequency bands and can lead
to feature vectors whose dimensionality is smaller than that of the previous mappings.
However, in BCI applications, the AR mapping presents an inconvenience residing in the
fact that a direct connection between the AR coefficients and the power in a given frequency
band is not evident [167]. Instead, this power is an intricate non-linear function of the AR
coefficients. This makes difficult to explain the physiological mechanism that the subject
actually uses to control an AR coefficient based BCI [175].
Qm X
Um
X 2πun
sm (n) = − ãm (i, u) exp −j sm (n − i) + em (n) (4.69)
Nspt
i=1 u=−Um
where Um is the spectral order associated with sm . Clearly, this equation is equivalent
to (4.68) with:
Um
X 2πun
am (n, i) = ãm (i, u) exp −j (4.70)
Nspt
u=−Um
70 Chapter 4. Feature extraction
−50
0 0.5 1 1.5 2 0 10 20 30 40 50
Time [s] Frequency [Hz]
−4.2
−4.4
−4.6
0 10 20 30 40 50 5 10 15 20 25 30 35
Frequency [Hz]
AR order
Figure 4.5. Autoregressive estimation of the PSD. Top left: Signal under study (the same as the
one in Fig. 4.2). Top right: estimated PSD computed using the Welch method (Section 4.3). Bottom
left: PSD approximations for different AR orders. As it can be seen, an AR order of 2 leads to
the PSD over-smoothing, for AR orders equal to 5 and 10, the PSD is relatively well approximated.
Bottom right: Logarithm of the prediction error power and the three AR order selection, i.e. final
prediction error (FPE), Akaike information (AIC) and minimum description length (MDL), criteria
are represented. The prediction error slowly decreases as the AR increases making this parameter,
considered alone, not suitable to adequately chose the AR order. Because of the penalization of too
large values of the AR order, the order selection criteria present an optimum which is more evident
in the MDL criterion as it has the largest penalization factor (see Equations 4.65 to 4.67).
4.6. Non-stationary autoregressive mapping 71
Figure 4.6. Block diagram of the non-stationary autoregressive model of the m-th channel. The
current output
sm
(n) depends on frequency shifted (frequency shifts are introduced via the products
2πn
by: exp ±j N spt
) versions of the Qm most recent outputs, sm (n − 1), . . . , sm (n − Qm ) and the
current input em (n).
The coefficients ãm (i, u) (NAR coefficients) can be determined by minimizing the pre-
diction error power:
Nspt −1
1 X 2
E(Qm , Um ) = em (n) (4.71)
Nspt
n=0
Nspt −1 Qm X
Um !2
1 X X 2πun
= sm (n) + ãm (i, u) exp −j sm (n − i)
Nspt Nspt
n=0 i=1 u=−Um
Taking the derivatives of E(Qm , Um ) with respect to the NAR coefficients and setting
them to zero, yields:
Qm Um Nspt −1
X X
′ ′ 1 X ′ 2π(u + u′ )n
ãm (i , u ) sm (n − i )sm (n − i) exp −j
′ ′
Nspt Nspt
i =1 u =−Um n=0
Nspt −1 (
1 X 2πun 1 6 i 6 Qm
+ sm (n)sm (n − i) exp −j = 0 for
Nspt Nspt −U m 6 u 6 Um
n=0
Taking expectations on both sides in the above equation and using the definition of
expected ambiguity function (4.29) yields:
Qm
X Um
X
ãm (i′ , u′ )Esm Asm (i − i′ , u − u′ ) = −Esm [Asm (i, u)] (4.72)
i′ =1 u′ =−Um
This set of linear equations generalize the Yule-Walker ones (4.62) to the non-stationary
case. The total number of NAR coefficients is equal to: Qm (2Um + 1). For slowly time-
varying am (n, i), a small Um suffices to characterize the frequency shifts [77].
72 Chapter 4. Feature extraction
The model and spectral orders Qm and Um can be selected similarly to the AR order in
Section 4.5. Namely, choosing those values that make the MDL criterion (4.67) minimum:
∗ ∗ ∗ ∗ log(Nspt )
(Qm , Um ) = argmin log (E(Qm , Um )) + Qm (2Um + 1) (4.73)
Q∗m ,Um
∗ Nspt
The estimation of Esm [Asm (i, u)] is obtained by segmenting sm into Nβ (possibly) over-
lapping N -length blocks and computing the average of the blocks ambiguity functions:
Nβ N −1
1 X 1 X 2πnu
Esm [Asm (i, u)] = sm,β (n − i)sm,β (n) exp −j (4.74)
Nβ N N
β=1 n=0
Since for EEG signals, the parameters am (n, i) of the general linear prediction mod-
el (4.68) slowly change in time [129, 146] the spectral orders: U1 , . . . , UNe are relatively
small (up to three, see Chapter 6). This makes the NAR particularly well suited for BCI
applications. However, alike the AR mapping (Section 4.5), the physiological interpreta-
tion of the NAR coefficients is difficult since there is no direct link between them and the
observed signals.
Taking the derivatives of E(Q) with respect to the elements of the matrices A(i) yields the
multivariate Yule-Walker equations [85, 134]:
− K(1) . . . K(Q) = A(1) . . . A(Q) K̃ (4.77)
Nspt
P−1
where K(τ ) = S(n − τ )S t (n) and
n=0
K(0) K(1) · · · K(Q − 1)
Kt (1)
K(0) · · · K(Q − 2)
K̃ =
.. .. ..
. . .
Kt (Q − 1) Kt (Q − 2) · · · K(0)
The matrices A(1), . . . , A(Q) can be found by inverting the QNe × QNe matrix K̃
in (4.77). However, (4.77) can be more efficiently solved by applying a generalized version
of the Levinson recursions [173].
The frequency domain form of the MVAR model is obtained by taking the Z-transform
on both
sides
in (4.75) and evaluating it in the unit circle in the z-plane, i.e. at z =
2πf
exp j fz where fs is the sampling frequency. Thus, the MVAR frequency-domain model
is:
ŝ1 (f) H1,1 (f) · · · H1,Ne (f) ê1 (f)
.. .. .. ..
. = . . . (4.78)
ŝNe (f) HNe ,1 (f) · · · HNe ,Ne (f) êNe (f)
where ŝm (f) is the spectrum of the m-th channel and êm (f) is the spectrum of the m-th
prediction error. The ê1 (f), . . . , êNe (f) can be thought of as the input spectra which are
filtered by the transfer functions Hm1,m2 (f) to produce the outputs ŝ1 (f), . . . , ŝNe (f) (see
Fig. 4.7). Since Hm1,m2(f) is different from Hm2,m1 (f), the transfer function: Hm1,m2(f) is a
sort of ”directed” intra-spectrum from the m2-th channel to the m1-th one [145].
The model order Q determines the shape of the transfer functions Hm1,m2 (f). In fact,
higher orders imply more peaks in the transfer functions (see Fig. 4.8). To determine the
optimal order Q we use the MDL criterion (Section 4.5). Thus, the optimal Q is selected
as:
log(Nspt Ne )
Q = argmin log((Q)) + QNe (4.79)
Q∗ Nspt
The feature vector ψMVAR is composed of the elements in matrices A(1), . . . , A(Q):
t 2
ψMVAR (S) = Ä(1) . . . Ä(Q) ∈ RQNe
where the notation Ä indicates that the elements in A are taken column-wise and rearranged
in a single row.
The MVAR mapping was used to determine the spreading of brain activity in a defined
frequency band by exploiting the concept of ”directed” intra-spectrum [145] that we men-
tioned earlier. As in the two previous mappings a direct connection between physiological
74 Chapter 4. Feature extraction
concepts and the MVAR coefficients does not exist. However, the representation of the
transfer functions as in Fig. 4.8 allows us to evaluate the interaction between channels at
different frequencies which appears to be non-symmetrical.
where Hs is the Hilbert transform1 of s. The analytic form can be further decomposed
as: s̄(n) = As (n) exp (jϕs (n)), where As (n) is the instantaneous amplitude and ϕs (n) the
instantaneous phase of s.
The degree of phase locking between sm1 and sm2 in the frequency band B is given by
1
The Hilbert transform can be determined using standard methods as presented in [82]
4.8. Synchronization mapping 75
H (f) H (f)
11 12
40 45 30
PSD: signal at electrode C3
Q=1 Q=1
40 Q=2 Q=2
30 25
Q=3 Q=3
35
20 30 20
25
10 15
20
0 15 10
0 10 20 30 40
10
PSD: signal at electrode C4 5
40
5
30 0 0
0 10 20 30 40 0 10 20 30 40
20
H (f) H (f)
21 22
10 25 50
Q=1 Q=1
0 Q=2 Q=2
0 10 20 30 40 20 Q=3 40 Q=3
15
15 30
Power inter−spectrum density
10
10 20
5 5 10
0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Frequency [Hz] Frequency [Hz] Frequency [Hz]
Figure 4.8. Multivariate autoregressive model for two EEG channels (C3 and C4). On the left, the
power spectral and inter-spectral densities and on the right, the corresponding transfer functions.
As it can be seen the higher Q the more peaks appear in the transfer function. The optimal order
Q is determined using the MDL criterion (4.79)
The signals under study are the same than that in Fig. 4.3
76 Chapter 4. Feature extraction
the modulus of the average of the set of complex relative phases [111]:
n o
exp jϕ(B)
sm1 ,sm2 (n) n = 0, . . . , N spt − 1
One can easily verify that Y(m1, m2, B) varies from zero, when the complex relative
phases are uniformly distributed in the complex unit circle, to one when the complex
relative phases are all equal.
The feature vector ψY is determined by selecting NB frequency bands {B1 , . . . , BNB },
computing the synchronization for each frequency band and pair of EEG channels and
grouping the results into an Ne (N2e −1) NB dimensional vector:
t Ne (Ne −1)
NB
ψY (S) = Y(1, 2, B1 ) . . . Y(m1, m2, Bi ) . . . Y(Ne − 1, Ne , BNB ) ∈R 2
4.9 Summary
The characterization of EEG is based on the analysis of the generalized interactions between
the EEG channels. By assuming some hypotheses on the properties of EEG signals, we
derived different mappings from the EEG-trial set into feature spaces whose characteristics
are determined by the hypotheses that define the corresponding mapping. Since a single
mapping appears to be insufficient for the recognition of all the MAs that are used to control
the BCI (see Chapter 6), an optimal association between an MA and a mapping should be
established.
The following hypotheses were used: stationarity and ergodicity, absence of coupling
between the EEG channels, existence of a linear prediction model and weak coupling be-
tween the EEG channels. The way in which these hypotheses are combined to obtain
the stationary PSD, coherence, autoregressive, non-stationary autoregressive, multivariate
autoregressive and synchronization mappings is depicted in Fig. 4.10.
The dimensionality of the feature vector spaces associated to each mapping is reported in
Table 4.1. In general, the mappings built on the hypothesis of existence of a linear prediction
model generate feature vectors with lower dimensionality. Furthermore such mappings do
4.9. Summary 77
50 50
0 0
−50 −50
50 50
Filtered signals
0 in the 8−13 Hz 0
band
−50 −50
100 100
Instantaneous
phases
50 50
0
0
0 0.5 1 1.5 2 0 0.5 Time [s] 1.5 2
10 1
Relative phase
Imaginary axis
Complex relative
0 phases
0
−10
−20 −1
0 0.5 1 1.5 2 −1 0 1
Time [s] Real axis
Figure 4.9. Estimation of the synchronization between two EEG channels in the alpha band. The
signals are first filtered in the alpha band, then their instantaneous phases and relative phase are
computed. In the lower right panel we represent the complex relative phases in the complex unit
circle. The value of the synchronization determined using (4.81) is: 0.759. The signals under study
are those in Fig. 4.3.
78 Chapter 4. Feature extraction
not require the choice of given frequency bands. However, the features that compose them
are not directly connected to specific brain events (e.g. the power in a given frequency band
or morphological signal properties). Thus, in a BCI based on feature vectors based on
general autoregressive features it can be difficult to understand the type of physiological
mechanisms that are actually used to control the BCI.
Figure 4.10. Derivation of the mappings from hypotheses on the nature of EEG trials.
Table 4.1. Dimensionality of the feature vectors associated to each mapping. (*) Typical values
for the parameters in the third column are (see Chapter 6): Ne = 16, NB = 10, Qm=1,...,Ne = 2,
Um=1,...,Ne = 2, and Q = 2. Using such values it appears that the coherence and synchronization
mappings produce the highest dimensional feature vectors whereas the autoregressive mapping pro-
duces the lowest dimensional one. Thus, unless a small number of frequency bands is considered,
the mappings built on the hypothesis of existence of a linear prediction model produce the feature
vectors with the smallest number of elements.
Pattern recognition 5
“The purpose of models is not to fit the data
but to sharpen the questions”
Samuel Karlin
5.1 Introduction
In the previous chapter we presented different mappings from the EEG-trial set into a
feature vector space X that is suitable for the recognition of a given MA in the controlling
set. We pointed out that the choice of the optimal mapping (which determines X ) depends
on the subject and the mental activity. In this chapter we assume that the choice of
the optimal mapping is done according to an optimality criterion (see Chapter 6) and
concentrate on the recognition process.
Let Ω be the set of all possible EEG-trials and Ωk the set of EEG-trials produced during
the performance of mental activity MAk . The optimal mapping for the recognition of MAk ,
denoted as ψ (k) maps Ω and Ωk into the feature vector space (induced by ψ (k) ) Xk and the
target set Xk respectively (see Fig. 5.1). Our goal is to estimate a measure of the likelihood,
denoted as fk (x) that a feature vector x ∈ Xk belongs to Xk . We call fk (·) the membership
function associated with the mental activity MAk .
As shown in Fig. 5.2, the feature extraction module delivers to the pattern recognition
one, NMA feature vectors denoted as x(1) , . . . , x(NMA ) , which are computed by applying the
optimal mappings ψ (1) , . . . , ψ (NMA ) to an EEG-trial S ∈ Ω. The pattern recognition module
in turn computes NMA membership functions: f1 (x(1) ), . . . , fNMA (x(NMA ) ) that are grouped
into a vector of memberships f~ that is sent to the action generation module which decides
on the action that the BCI executes (see Chapter 6).
79
80 Chapter 5. Pattern recognition
Figure 5.1. The optimal mapping for the recognition of MAk , ψ (k) maps the set of EEG-trials Ω
into a feature vector space Xk . In Xk a feature vector is characterized with respect to its membership
to the target set Xk (i.e. the set of feature vectors produced during the performance of MAk )
Each membership function fk is learned in a supervised way, i.e. the exact membership
of a given set (the training set) of feature vectors belonging to Xk is known and fk is
estimated so as to minimize the discrepancy between the memberships computed by fk and
the real ones. Note that the exact membership of an element in the training set can only
take two values: belongs or not to the target set. In contrast, the range of fk (·) is in the
real numbers, i.e. different degrees of membership exist.
The shape of target sets can change over time as a consequence of environmental factors
or the subject’s state of mind (fatigue, stress, etc [29]). Moreover, as the subject acquires
more experience in using the BCI his brain dynamics may exhibit some changes resulting
from his adaptation to the BCI [36]. Such adaptation induces changes on the target sets.
Thus, a static learning approach, in which the membership functions remain constant is
clearly suboptimal. Instead, they need to be continuously adapted according to a dynamical
learning strategy in which they are updated as new training data become available while
progressively forgetting the contribution of old data.
In the following we present the methods to learn and dynamically update the mem-
bership functions. These methods are based on the statistical learning theory [166], kernel
methods [147] and support vector machine learning algorithms [23]. Instead of introducing
the support vector machine learning concepts using the classical large margin classifier ap-
proach [23, 113, 147, 165] we focus on the concept of loss and risk to derive the learning
and dynamical updating algorithms from the same framework.
5.2. Membership functions 81
Figure 5.2. The feature extraction module delivers to the pattern recognition one, NMA feature
vectors: x(1) , . . . , x(NMA ) , where x(k) = ψ (k) (S), S is the current EEG-trial, and ψ (k) is the optimal
mapping for the recognition of mental activity MAk .
The pattern recognition module computes NMA membership functions: f1 (x(1) ), . . . , fNMA (x(NMA ) )
which are grouped into a vector of memberships f~, that is sent to the action generation module
which decides on the action that the BCI executes.
The membership functions can be thought of as comparison models for the mental activities that
are used to control the BCI. Such models are subject dependent and continuously updated.
where x(k) ∈ Xk and fk (x(k) ) is the membership function associated with MAk . The ideal
fk (i.e. the error free membership function) is such that:
f (x(k) ) + b > ρ
k k k if x(k) ∈ Xk
(5.1)
(k)
fk (x ) + bk 6 −ρk if x ∈ (k) / Xk
where ρk > 0 and bk ∈ R are the threshold and the offset of fk (·) respectively (see Fig. 5.3).
We call fk , ρk , and bk the membership parameters associated with MAk whose estimation
from observed data is the object of next section.
82 Chapter 5. Pattern recognition
Figure 5.3. Distribution of the ideal membership function, fk with respect to its target set Xk .
According to (5.1) the membership values of the feature vectors in Xk are located right from:
ρk − bk and that of feature vectors not belonging to Xk are located left from: −ρk − bk .
fk (x(k) ) + bk
ζk (x(k) ) = (5.2)
ρk
one can easily verify that: ζk (x(k) ) > 1 if x(k) ∈ Xk , and ζk (x(k) ) 6 −1 if x(k) ∈
/ Xk . In this
chapter, as we seek to determine the membership parameters we consider the form in (5.1).
(k) (k)
where the membership value (or label) yl of xl is defined as:
+1 if x(k) ∈ X
(k) l k
yl =
−1 otherwise
We assume that the training set was independently drawn from a probability density
function pk (x(k) , y (k) ).
From the definition of the ideal membership function (5.1) and Fig. 5.3 it comes out that
the ideal distribution of the product y (k) fk (x(k) ) + bk (we call it product-distribution)
should be concentrated right from ρk . However, the product-distributions corresponding to
estimates of the membership parameters can spread left from ρk . Thus, the quality of an
5.3. Estimation of the membership parameters 83
estimation is characterized by the deviation of its product-distribution from the ideal one.
Such deviation is given by the risk functional presented in Section 5.3.2.
Henceforth, we adopt the following notation conventions. First, in order to simplify
the notation we remove the index k from every parameter. Indeed, the concepts behind
the estimation of ρk , bk , fk are identical for every MAk (thus, the feature vector space, the
target set and the training set are denoted as X , X, and Str respectively). Second, we use
the ideal superscript to denote the ideal value of the membership function, i.e. f should be
interpreted as an estimate of f ideal .
where γ(·) is a derivable monotonically decreasing function in ]−∞; ρ[ such that: γ(ρ) =
−νρ in order to ensure the continuity of the loss function.
A non-zero constant loss of −νρ is assigned to the (non-penalized) zone located right
from the straight line: yf (x) + yb = ρ. The reason for this is that ν permits to es-
tablish a bound on the membership errors (or recognition errors) in the training set (see
Section 5.3.5).
As shown in Fig. 5.4, the penalized zone, located left from yf (x) + yb = ρ is penalized
by the function γ(·). We consider γ(·) as a polynomial function of degree q > 1 defined as:
As shown in Sections 5.3.3 and 5.6, the penalty degree q plays an important role in the
dynamical updating of the membership parameters.
Figure 5.4. The loss function penalizes the values of the product y (f (x) + b) that are smaller than
ρ. The penalization function γ(·) is a polynomial of degree q that ensures the continuity of the loss
function at the limit between the penalized and the non-penalized zones. The non-penalized zone
corresponds to a constant loss of −νρ. The reason for having a non-zero loss in the non-penalized
zone is because ν permits to control the fraction of membership errors in the training set (see
Section 5.3.5).
the risk functional minimum. However, the probability density function p(x, y) is generally
unknown in practical applications. An empirical estimate of p(x, y) can be obtained from
training data as follows.
L
1X
pemp (x, y) = δa (x − xl )δa (y − yl ) (5.6)
L
l=1
where (xl , yl ) ∈ Str and δa (·) is the analog Dirac’s delta function.
By replacing pemp (x, y) into the definition of the risk functional, we obtain the empirical
risk functional:
L
1X
Remp [f ] = c(xl , yl , f (xl )) (5.7)
L
l=1
Direct minimization of the empirical risk to obtain ρ, b and f is an ill conditioned
problem [147, 165], i.e. small changes in the training set may induce large changes in the
estimated parameters. Furthermore, the resulting estimation is biased [72] because the risk
functional is an ensemble statistic independent from any particular pair (xl , yl ) whereas the
empirical risk depends on the training set only.
Ill posed problems can be effectively solved by adding a regularization term [27, 166] to
the original (ill posed) problem. In order to regularize the minimization of the empirical
risk we introduce a functional space H to which f belongs. We then obtain the regularized
risk functional Rreg [f ] as follows.
ℓ
Rreg [f ] = Remp [f ] + hf, f iH (5.8)
2
ℓ
where h·, ·iH is the inner product in H, ℓ ∈ R+ is the regularization constant, and 2 hf, f iH
is the regularization term.
5.3. Estimation of the membership parameters 85
Figure 5.5. The ideal membership function f ideal is in a general functional set which is not nec-
essarily a functional space. The minimizer of the regularized risk (optimal estimate) lies in the
functional space H. The (functional) distance between the optimal solution and the true solution is
the approximation error which depends on the choice of H.
Definition 5.2. Reproducing Kernel Hilbert Space (RKHS) [93, 147, 169]
H is a RKHS if it is a Hilbert space and the following properties are satisfied
This property implies that any matrix K with elements Kmn = K(xl , xm ) is positive
semi-definite.
Using the fact that H is equal to the span of functions K(x ∈ X , ·), the minimizer of the
regularized risk (5.8) can be decomposed into a part contained in the span of the elements
in the training set and one in the orthogonal complement. This yields:
L
X
f (·) = αl K(xl , ·) + f⊥ (·) (5.10)
l=1
where (xl , yl ) belongs to the training set, αl ∈ R, and the function f⊥ ∈ H is such that:
hf⊥ , K(xl , ·)iH = 0 for l = 1, . . . , L.
By replacing (5.10) into the regularized risk (5.8), one gets
* L L
+ !
ℓ X X
Rreg [f ] = Remp [f ] + αl K(xl , ·), αl K(xl , ·) + hf⊥ , f⊥ i
2
l=1 l=1 H
* L L
+
ℓ X X
> Remp [f ] + αl K(xl , ·), αl K(xl , ·)
2
l=1 l=1 H
5.3. Estimation of the membership parameters 87
Thus, for any fixed α1 , . . . , αL ∈ R the regularized risk is minimized for f⊥ = 0. Therefore,
f is a linear combination of { K(xl , ·)| (xl , yl ) ∈ Str }.
L
X
f (·) = αl K(xl , ·) (5.11)
l=1
The coefficient αl is called the expansion coefficient associated with xl . Thus, to estimate
f amounts to estimate the expansion coefficients of the training vectors. This result is a
particular case of the representer theorem of Kimeldorf and Wahba [92].
This relation states that the inner product of two elements in H, namely φ(x) and φ(x′ ) can
be simply calculated by applying the kernel function on the pre-images of those elements
(i.e. x and x′ ). This constitutes the essence of the well known ”kernel trick” (see Chapter 3,
Section 3.4.2 and [1]).
By computing the membership function (5.11) of an x ∈ X , using the reproducing
property (5.9) and the mapping φ we get
L
X
f (x) = αl K(xl , x) (5.13)
l=1
L
X
= αl hK(xl , ·), K(x, ·)iH (5.14)
l=1
= hw, φ(x)iH (5.15)
where
L
X L
X
w = αl K(xl , ·) = αl φ(xl ) (5.16)
l=1 l=1
From the above considerations, it can be said that the map φ makes the sets:
linearly separable by f with margin M (see Fig. 5.6) defined as the distance between the
separating margins.
2ρ
2M = (5.18)
kwkH
Figure 5.6. Separating margins hw, φ(x)iH + b = ±ρ in H. The filled squares represent the φ(xl ) for
which yl = +1 and the stars those for which yl = −1 (the elements (xl , yl ) belong to the training
set). The φ(xl ) that are located between the separating margins are called margin errors (they have
positive expansion coefficients equal to L1 ). The on-margin-elements have their positive expansion
coefficients in 0; L1 .
Those φ(xl ) whose membership is correctly decided by f and are not on the margins are called
non-support vectors and their expansion coefficients are equal to zero.
5.3. Estimation of the membership parameters 89
ℓ
Rreg [f ] = Remp [f ] + kwk2H (5.19)
2
The membership parameters are then given by
Hypothesis 5.3. ∀ (xl , yl ) ∈ Str , yl (f (xl ) + b) > ρ, i.e. the training data lie in the non-
penalized zone of the loss function. In other words the membership of each element in Str
is correctly decided by f .
Under this hypothesis, the loss function of each element in the training set is equal to:
−νρ (see Section 5.3.1). The empirical risk (5.7) being the average of the loss function in
the training set, is also equal to −νρ. Thus, the regularized risk becomes
ℓ
Rreg [f ] = kwk2H − νρ (5.21)
2
Since f and w are equivalent (5.14), we can obtain the first from the latter. Thus, the
membership parameters are obtained by solving the optimization problem1 :
ℓ
(f, ρ, b) = arg min kwk2H − νρ (5.22)
w,ρ,b 2
constrained to
Geometrically, to minimize (5.21) amounts to maximize the distance between the sepa-
rating margins M (5.18). This concept is central (and usually the starting point) to support
vector machines learning algorithms which are considered as large margin classifiers [10].
The minimization of Rreg subject to constraints (5.23) and (5.24) is called ”hard margin”
optimization because no membership errors in the training set are allowed. However, a small
number of membership errors in the training set (training error) does not necessarily lead to
good predictions of the membership of feature vectors that were not used in the membership
1
By abuse of notation (since f and w are equivalent (5.14)) we write (f, ρ, b) = arg min {. . .} for (w, ρ, b) =
w,ρ,b
arg min {. . .}.
w,ρ,b
90 Chapter 5. Pattern recognition
parameters estimation (unseen feature vectors). The error incurred by f in predicting the
membership of unseen feature vectors is called the generalization error. As it is pointed out
in Section A.1 in the appendix, a too small training error leads to over-fitting, i.e. small
training error and large generalization error.
To control the over-fitting, we modify our initial hypothesis (Hyp. 5.3) by relaxing the
constraints (5.23). Such relaxation is carried out by introducing positive slack variables
ξ1 , . . . , ξL so that the new constraints become: yl (hw, φ(xl )iH + b) > ρ − ξl . The slack
variables have the effect of bringing the training data into the penalized zone of the loss
function (see Section 5.3.1), i.e. c(xl , yl , f (xl )) = −νρ + ξlq for l = 1, . . . , L. The empirical
risk (5.7) associated with the training data under the relaxed constraints is:
L
1X q
Remp = −νρ + ξl (5.25)
L
l=1
L
ℓ 1X q
Rξ [f ] = kwk2H − νρ + ξl (5.26)
2 L
l=1
Therefore, the membership parameters are estimated by minimizing the relaxed risk
under the relaxed constants, as follows:
L
!
ℓ 1X q
(f, ρ, b) = arg min kwk2H − νρ + ξl (5.27)
w,ρ,b,ξ1 ,...,ξL 2 L
l=1
constrained to
for l = 1, . . . , L.
It is worth noting that while the constraints are relaxed by the slack variables, the sum
PL
ξlq prevents too many ξl becoming larger than zero. In this way, the slack variables
l=1
determine the tradeoff between over-fitting and training error.
The relaxed optimization is handled by introducing positive Lagrange multipliers
α̃1 , . . . , α̃L , β1 , . . . , βL , δ > 0 and a Lagrangian ΛP .
L
ℓ 1X q
ΛP = kwk2H − νρ + ξl
2 L
l=1
(5.31)
L
X
− (α̃l (yl (hw, φ(xl )iH + b) − ρ + ξl ) + βl ξl ) − δρ
l=1
5.3. Estimation of the membership parameters 91
translated without changing the optimum value of ΛD [24]. This situation is highly excep-
tional in practice. Furthermore, in Section 5.6 we show that in order to dynamically update
the membership parameters q must be set to one.
By setting q = 1, the second term on the right in (5.36) vanishes making the regulariza-
tion constant ℓ no longer relevant. For convenience we set ℓ to one. Thus, the α̃l are found
by minimizing: −ΛD |ℓ=1,q=1 .
L
1 X
(α̃1 , . . . , α̃L ) = arg min α̃l α̃m yl ym K(xl , xm ) (5.37)
α̃1 ,...,α̃L 2
l,m=1
constrained to:
1
0 6 α̃l 6 (results from 5.35, q = 1, and βl > 0 ) (5.38)
L
L
X
α̃l > ν (results from (5.33) and δ > 0) (5.39)
l=1
L
X
yl α̃l = 0 (results from (5.34)) (5.40)
l=1
Standard quadratic programming techniques [168] can be used to solve the above opti-
mization problem and find the optimum values of α̃1 , . . . , α̃L . These coefficients completely
determine the membership parameters as we explain later in the text. For ease of explana-
tion we analyze the solution in function of the α̃l ’s.
At the optimum, the Karush-Kuhn-Tucker (KKT) conditions [95] imply that the fol-
lowing relations hold.
The position of φ(xl ) with respect to the separating margins: hw, φ(x)iH + b = ±ρ,
depends on the α̃l . According to (5.38) three possibilities exist:
• If α̃l = 0 then, yl (hw, φ(xl )iH + b) > ρ. Therefore, the membership of xl with respect
to X is correctly determined (see Fig. 5.7a).
• If 0 < α̃l < L1 , then yl (hw, φ(xl )iH + b) = ρ. Again the membership of xl with respect
to X is correctly determined (Fig. 5.7b). In this case, φ(xl ) is an on-margin-element,
i.e. it lies on the separating margin: hw, φ(x)iH + b = yl ρ (see Fig. 5.6).
Figure 5.7. Membership of xl depending on the value of α̃l . Top: when α̃l = 0, the membership
of xl is correctly determined. Middle: when 0 < α̃l < L1 , φ(xl ) is on the separating margin:
hw, φ(x)iH + b = yl ρ. The membership of xl is correctly decided. Bottom: when α̃l = L1 , the
membership is correctly determined if ξl = 0 and wrongly determined if ξl > 0.
For properly collected training data and an adequate choice of the kernel function
(Sect. 5.4) one expects that most of the training elements satisfy the condition in Fig. 5.7a,
i.e. the solution is expected to be sparse in the α̃l ’s. As a matter of fact, the number of α̃l ’s
different from zero constitutes an indication on the expected generalization error [147, 165].
The role of ν
In Section 5.3.1 we assigned a constant loss of −νρ to the non-penalized zone and mentioned
that the parameter ν permits to control the training error. To illustrate this, we consider
the KKT condition on ρ (5.43). If ρ is strictly positive then, δ = 0 which according to (5.33)
PL
imply: α̃l = ν. In particular, the sum of the α̃l ’s for which the membership of their
l=1
respective xl is wrongly determined (i.e. ξl > 0) should satisfy:
X
α̃l 6 ν (5.44)
1
l|
α̃l = L ; ξl >0
94 Chapter 5. Pattern recognition
X 1
ν= α̃l 6 |{ l| α̃l > 0}| (5.46)
L
l|α̃l >0
The φ(xl ) (and by extension the respective xl ) for which αl > 0 are called support vectors
because they completely determine the membership of any x ∈ X. Thus, from (5.46), ν
lower bounds the fraction of support vectors (FSV). Combining (5.46) and (5.45) we obtain
the ν inequality:
FTE 6 ν 6 FSV (5.47)
The smaller ν the smaller the FTE. However, a too small FTE will not generally lead to
a small generalization error because of the over-fitting. Thus, ν needs to be adjusted in order
to reach a compromise between the good generalization (i.e. small expected generalization
error) and the FTE. As we show in the next section, the generalization error and the FTE
depend also on the kernel function. An interdependent choice of ν and the kernel function
is presented in Section 5.5.
So far, we have discussed the solution of the relaxed optimization in terms of the α̃l ’s.
We now turn to determining the membership parameters f, ρ, b from the α̃l ’s.
The membership function f is completely determined by the expansion coefficients αl
which in turn, according to (5.16) and (5.32) satisfy
The decision threshold ρ and decision offset b are obtained by taking two elements xl1
and xl2 such that:
1
l1 = arg min abs − α̃l1
l1|yl1 =+1 2L
1
l2 = arg min abs − α̃l2
l2|yl2 =−1 2L
5.4. Kernel function 95
Figure 5.8. The functional map φ is such that it transforms an element x ∈ X into a function
φ(x) = K(x, ·) which represents a measure of the similarity between x and all the other elements
in X . Thus, we expect K(x, ·) to be centered around x and take its maximum value at this same
point.
From Fig. 5.7(b) we know that φ(xl1 ) and φ(xl2 ) are on the margins, therefore:
1
ρ = (hw, φ(xl1 )iH − hw, φ(xl2 )iH ) (5.50)
2
1
b = − (hw, φ(xl1 )iH + hw, φ(xl2 )iH ) (5.51)
2
Theoretically any xl1 , xl2 whose corresponding α̃l1 , α̃l2 are in 0; L1 can be selected to
estimate ρ and b. But, for numerical precision reasons, the choice of two elements such that
their positive expansion coefficients are close to L2 allow us to improve the reliability of the
result.
as:
Kd (x1 , x2 ) = (hx1 , x2 iX )d (5.52)
If d is set to 1 we obtain the linear kernel which represents the inner product in X . In
this case the space H is equivalent to X .
Kd=1 (x1 , x2 ) = hx1 , x2 iX (5.53)
One can prove [139] that the larger the polynomial kernel grade the smaller the FTE.
Yet, large generalization errors are associated with large values of d. A compromise can
be reached through cross-validation. In a more elaborated approach [26], d depends on the
minimization of a theoretical bound on the generalization error.
The disadvantage of the polynomial kernel resides in its sensitivity with respect to
scaling factors [157]. Indeed, if the x’s are not centered around the origin (usually they are
far from the origin, especially when band power values are used to determine the feature
vector space (see Chapter 4, Section 4.3), see Chapter 4) their norms are large and the
angle between them is small. In this case the polynomial kernel becomes:
Kd (x1 , x2 ) = kx1 kdX kx2 kdX cosd ∡ (x1 , x2 ) ≈ kx1 kdX kx2 kdX
Thus, Kd is determined only by the norm of its arguments, without taking into account
the angle (i.e. the genuine dissimilarity factor). A possible way to handle it, would consist
in normalizing the training data before estimating the membership parameters. The nor-
malization process consists in making each component of the training vectors zero average
and unit standard deviation. The normalization parameters are then stored so as to apply
them on new data. Another, more principled way to prevent scaling problems consists in
whitening the training data by diagonalizing their covariance matrix [20].
The dimension of the functional space generated by Kσ is infinite [147, 157]. This can
be intuitively understood because it is possible to find an infinite number of pairs x1 , x2 ∈ X
such that: hφ(x1 ), φ(x2 )iH = Kσ (x1 , x2 ) ≈ 0.
The Gaussian kernel constitutes an attractive choice for our application because of its
outstanding classification performance in the EEG framework [59, 61, 62, 63], its relative
insensibility to scaling factors and capacity to accurately approximate any classification
surface [148].
Generally, the smaller σ the smaller the FTE (see Section A.1 in the appendix). Howev-
er, a too small σ leads to over-fitting. The following proposition summarizes the influence
of σ on the FTE and FSV.
5.4. Kernel function 97
ii) FTE [Kσ≫1 , ν] = FTE [Kd=1 , ν] where Kd=1 is the linear kernel defined in (5.53).
Too large values of σ under-fit the training data.
Proof
By replacing (5.55) into (5.37), the coefficients α̃l are then found by solving:
L
!
1X 2
(α̃1 , . . . , α̃L ) = arg min α̃l (5.56)
α̃1 ,...,α̃L 2
l=1
L
P
Since ρ > 0, (5.39) implies α̃l = ν. Then the solution of (5.56) is:
l=1
ν 1
α̃1 = . . . = α̃L = <
L L
Thus, φ(x1 ), . . . , φ(xL ) are all on the separating margin (see Fig. 5.7). Consequently,
the membership of all the training data is correctly determined, i.e. FTE [Kσ→0 , ν] = 0.
Also, since all the α̃l ’s are strictly positive FSV [Kσ→0 , ν] = 1.
L
P
Since, according to (5.34), yl α̃l = 0 the first two terms on the right in (5.58) vanish.
l=1
Using the definition of linear Kernel (5.53), the coefficients α̃l are found by solving:
1 X
(α̃1 , . . . , α̃L ) = arg min 2 yl ym α̃l α̃m Kd=1 (xl , xm ) (5.59)
α̃1 ,...,α̃L σ
l,m
The optimum α̃l being identical for Kσ≫1 and Kd=1 , FTE [Kσ≫1 ] = FTE [Kd=1 ].
Section A.1 in the appendix) constitutes the maximum FTE for Gaussian kernels. In
practice, FTE [Kd=1 , ν = 0.5] can be larger than 0.5; the initial ν is therefore given by:
1
ν0 = min , FTE [Kd=1 , ν = 0.5] (5.60)
2
To determine the loose interval for σ we need to sample the function GE [Kσ , ν0 ], i.e. the
generalization error estimate in function of the Gaussian kernel parameter. Between the
small and large values of σ which respectively, over-fit and under-fit the training data
(Proposition 5.4) a small value for GE [Kσ , ν0 ] has to be found.
Since the Gaussian kernel dissimilarity measure is the Euclidean distance of its argu-
ments, it makes sense to take, as extreme values for σ, the minimum ∆min and the maximum
∆max of the Euclidean distance in the training set.
The evolution of GE [Kσ , ν0 ] for σ in [∆min , ∆max ] can be efficiently covered (i.e. with
relatively few values) by geometrically sampling in [∆min , ∆max ]. Thus, the set of σ values
at which GE [Kσ , ν0 ] is evaluated is:
n Nσ −v v−1
o
Vσ = σv = (∆min ) Nσ −1 (∆max ) Nσ −1 v = 1, . . . , Nσ
1
Nσ −1
where Nσ is the number of samples. The sampling ratio is: ∆ max
∆min
Let σv∗ be the value for which the generalization error estimate in Vσ is minimum,
namely, v∗ = arg min (GE [Kσv , ν0 ]). Then, the loose interval for σ is:
v=1,...,Nσ
Iσ = [σv∗−1 ; σv∗+1 ]
We mentioned earlier that the FTE is a growing function of σ. This means that ν can
be set to: FTE Kσv∗+1 , ν0 , i.e. the training error associated with σv∗+1 and ν0 .
By linearly sampling in Iσ , one can readily find an approximation of: arg min (GE [Kσ , ν]).
σ∈Iσ
The approximation accuracy depends on the sampling resolution.
The cross-validation approach constitutes a practical way to select σ and ν that more
often than not leads to very good results [62, 147]. However, as explained in [147] this
approach amounts to optimize the parameters on the same set as the one used for training,
which can potentially lead to over-fitting.
In Fig. 5.9 we report a case (that appears often in practice) in which the location of the
σ that makes GE [σ, ν] minimum is rather fuzzy. Indeed, there is almost no difference in
choosing either of the σ values indicated by the vertical dashed lines. We can empirically
remove the fuzziness by also considering the FSV which, according to [147] constitutes
an upper bound on the generalization error and is also linked with the complexity of the
decision surface (see Section A.1 in the appendix). Thus, the selection of σ can be modified
100 Chapter 5. Pattern recognition
FSV
0.7
FTE
GE
Th. bound
ν=0.45
0.6
0.5
ν
0.4
0.3
0.2
0.1
5 8.13 16.37 20 25 30 35 40 45
σ/∆
min
Figure 5.9. Evolution of the FSV, FTE, GE and the theoretical bound B in function of σ normalized
to ∆min . The dotted line marks the value of ν which upper bounds the FTE and lower bounds the
FSV. In this example, the optimum value of σ is rather fuzzy. Indeed, if the GE only is considered
there is almost no difference in choosing either of the values indicated by the vertical dashed lines.
On the opposite, the theoretical bound B clearly indicates the most adequate choice.
The theoretical bound curve was conveniently scaled for visualization purposes.
by taking the minimum of an empirical aggregate criterion that takes into account the
GE and a strictly growing function of the FSV. In fact, such criterion can be obtained
using the minimum description length (MDL) general framework (see [140] and Chapter 4
,Section 4.5). The MDL based choice for σ is:
log L
σ = arg min log (GE [σ, ν]) + FSV [σ, ν] (5.63)
σ L
σ ∗(0) = σmax
∗(0) 1
ν = min , FTE [Kd=1 , ν = 0.5]
2
B (0) = B (0) σ ∗(0) , ν ∗(0)
h i
FTE(0) = FTE Kσ∗(0) , ν ∗(0)
n = 1
6: repeat
σ ∗(n) = σ ∗(n−1) − ησ ∂σ B (n−1)
σ=σ ∗(n−1)
∗(n) (n−1)
ν = FTE
(n) (n) ∗(n) ∗(n)
B = B σ ,ν
h i
FTE(n) = FTE Kσ∗(n) , ν ∗(n)
n = n+1
7: until B (n−1) > B (n−2) (i.e. B starts to increase). Thus, σ ∗ and ν ∗ are set to σ ∗(n−1)
and ν ∗(n−1) respectively.
The notation B (n) = B (n) σ ∗(n) , ν (n) means that B (n) is computed using the values
of the radius and the margin corresponding to σ ∗(n) and ν ∗(n) . Note that since ν upper
bounds the FTE, its value at the n-th step is set to the previous FTE, namely FTE(n−1) .
In particular, we have shown that the membership function f can be written as:
PL
f (·) = l=1 αl K(xl , ·) and is sparse in the the expansion coefficients αl . The non-zero
5.6. Dynamic updating of the membership parameters 103
where Rreg [Str ] is the regularized risk (5.8) associated with the training set.
When new feature vectors whose membership is known become available, the member-
ship parameters need to be estimated again so as to adapt them to possible changes. Let
{(xL+1 , yL+1 ) , . . . , (xL+m , yL+m )} be the set of new training data, the new set of member-
ship parameters DL+m is therefore given by:
D(L+m) = arg min (Rreg [Str ∪ {(xL+1 , yL+1 ) , . . . , (xL+m , yL+m )}]) (5.70)
To determine R we apply the method presented in [93] according to which, the reg-
ularized risk (5.8) with ℓ = 1, is locally approximated at (xL+m , yL+m ) by the stochastic
risk
h i
Rstoch f (L+m−1) , L + m = c xL+m , yL+m , f (L+m−1) (xL+m )
1 D (L+m−1) (L+m−1) E (5.73)
+ f ,f
2 H
L+m−1
X (L+m−1)
f (L+m−1) (·) = αl K(xl , ·) (5.74)
l=1
and c xL+m , yL+m , f (L+m−1) (xL+m ) is the loss function function corresponding to the new
training pair (xL+m , yL+m ) and the previous membership function f (L+m−1)
If we denote as θ any element in D, its updating relation is:
(L+m) (L+m−1) ∂Rstoch f (L+m−1) , L + m
θ =θ − ηm (5.75)
∂θ(L+m−1)
where ηm ∈ R+ is the updating coefficient when the (L + m)-th training element becomes
available. The membership parameters of index (L) are those estimated using Str . This
(L)
means: αl = αl , ρ(L) = ρ and b(L) = b.
104 Chapter 5. Pattern recognition
f (L+m) = f (L+m−1)
! !
∂c xL+m , yL+m , f (L+m−1) (xL+m ) ∂f (L+m−1) (xL+m )
− ηm
∂f (L+m−1) (xL+m ) ∂f (L+m−1) (5.77)
(L+m−1) L+m−1 !
ηm ∂ f ,f H
−
2 ∂f (L+m−1)
∂ hg, f iH
= g (5.78)
∂f
∂ hf, f iH
= 2f (5.79)
∂f
Using (5.78), (5.79), and the reproducing property (Def. 5.2); the functional derivatives
in (5.76) are given by:
∂f (L+m−1) (xL+m ) ∂ Kσ (xL+m , ·) , f (L+m−1) H
= (5.80)
∂f (L+m−1) ∂f (L+m−1)
= Kσ (xL+m , ·)
(L+m−1) L+m−1
∂ f ,f H
= 2f (L+m−1) (5.81)
∂f (L+m−1)
Replacing the loss function definition (see Section 5.3.1), (5.80), and (5.81) in the mem-
bership function updating equation (5.77) yields:
f (L+m) = (1 − ηm ) f (L+m−1)
q−1 h i (5.82)
+ ηm yL+m q g (L+m−1) (xL+m ) Θ g (L+m−1) (xL+m ) Kσ (xL+m , ·)
5.6. Dynamic updating of the membership parameters 105
where g (L+m−1) (xL+m ) = −ρ(L+m−1) + yL+m f (L+m−1) (xL+m ) + b(L+m−1) and Θ [u] is
such that
1 u<0
Θ [u] =
0 u>0
It should be noted that if g (L+m−1) (xL+m ) is positive or zero the membership of xL+m is cor-
rectly determined by f (L+m−1) . In this case, (5.82) reduces to: f (L+m) = (1 − ηm ) f (L+m−1)
By replacing the membership functions by their respective linear expansions in terms
of the kernel functions (5.74) we get the updating equations for the expansion coefficients
αl .
If the penalty degree q is larger than 1, the second term on the right side of (5.82) has
multiplicative terms of the form:
(L+m−1) (L+m−1)
αl1 · · · αlq−1 Kσ (xL+m , ·) Kσ (xl1 , ·) · · · Kσ xlq−1 , ·
Since such multiplicative terms do not exist on the left side of (5.82) they have to
be null, i.e. the penalty degree q should be set to 1. Thus, the update equations for the
expansion coefficients are:
(L+m) (L+m−1)
αl = (1 − ηm ) αl for l = 1, . . . , L + m − 1 (5.83)
(L+m)
0 if g (L+m−1) (xL+m ) > 0
αL+m = (5.84)
ηm yL+m otherwise
If the membership of the most recent training element, namely xL+m is correctly de-
(L+m)
termined by f (L+m−1) , its corresponding expansion coefficient αL+m is set to zero. Given
the updating equation (5.83) it is clear that αL+m will remain equal to zero. Thus, xL+m
does not contribute to the decision function f and can be safely ”forgotten”.
On the other hand, if the membership of xL+m is wrongly determined, its corresponding
expansion coefficient is set to ηm yL+m , i.e. xL+m becomes a support vector. In Section 5.3.5
we have seen that the expansion coefficient associated with a support vector whose mem-
1
bership is wrongly decided is equal to L+m yL+m . The latter suggest that ηm should be
1
set equal to L+m . However, for m large enough ηm becomes closer to zero which makes
the contribution of xL+m insignificant. This is certainly not suitable as f would be deter-
mined by the first training elements only. To deal with this problem we consider the fact
that only the support vectors determine the membership function. The effective number
of training elements when xL+m becomes available is then equal to the number of support
vectors at the time just after the (m − 1)-th updating is completed (this number is denoted
as NSV(m−1) ). Therefore, the coefficient η at the m-th updating is given by:
1
ηm = (m−1)
(5.85)
NSV +1
106 Chapter 5. Pattern recognition
We now turn to assessing the evolution of a expansion coefficient which appeared at the
m-th updating, after m̂ steps. Using the updating equations (5.83) and (5.84) we have:
m̂
Y
(L+m+m̂)
αL+m = ηm yL+m (1 − ηm+l )
l=1
m̂
Y NSV(m+l−1)
= ηm ym (5.86)
l=1
NSV(m+l−1) + 1
• The memberships of xL+m+1 , . . . , xL+m+m̂ are wrongly determined, i.e. the number
of support vectors increases at each updating. Then, after the m̂-th updating, αL+m
becomes:
m̂
(L+m+m̂)
Y l − 1 + NSV(m)
αL+m = ηm yL+m
l=1
l + NSV(m)
(5.87)
NSV(m)
= ηm yL+m
m̂ + NSV(m)
• The memberships of xL+m+1 , . . . , xL+m+m̂ are correctly determined, i.e. the number of
support vectors remained constant at each updating. Then, after the m̂-th updating,
αL+m becomes:
m̂
(L+m+m̂)
Y NSV(m)
αL+m = ηm yL+m
l=1
1 + NSV(m)
!m̂ (5.88)
NSV(m)
= ηm yL+m
1 + NSV(m)
(L+m+m̂)
In both cases, αL+m tends to zero as m̂ grows to infinity. However, (5.88) converges
exponentially to zero while (5.87) converges linearly. Thus, we can state that the forgetting
speed of an expansion coefficient is approximately determined by the number of correct
membership predictions in which it participated. Using this fact, we can approximate the
number of correct decisions M̂ that an expansion coefficient, associated with the training
element xL+m , ”survives to”. Let ε be the machine precision, then (5.88) yields:
!M̂
NSV(m)
ε > (5.89)
1 + NSV(m)
log ε
M̂ = (5.90)
log NSV(m)
1+NSV(m)
where ⌈·⌉ is the ceiling function, i.e. this function gives the smallest integer that is larger
than its argument.
5.7. Summary 107
ρ(L+m−1) + η ν
m if g (L+m−1) (xL+m ) > 0
ρ(L+m) = (5.91)
(L+m−1)
ρ − ηm (1 − ν) otherwise
b(L+m−1) if g (L+m−1) (xL+m ) > 0
b(L+m) = (5.92)
(L+m−1)
b + ηm yL+m otherwise
5.7 Summary
The recognition of the mental activities that are used to control the BCI is carried out in
feature vector spaces that are subject and mental activity dependent. Thus, each mental
activity has an associated feature vector space in which we define a target set, composed
of the feature vectors produced during the performance of the targeted mental activity.
The recognition goal is to determine the membership of a feature vector with respect to
the target set. This is done by means of the membership parameters, namely the member-
ship function, threshold and offset. These parameters are estimated in a supervised way,
i.e. using a set of (training) feature vectors whose membership is known.
Since the shape of the target set can change because of different environmental and
user related conditions including the adaptation of the subject to the BCI, the membership
parameters need to be updated as new training data become available while forgetting the
contribution of old training feature vectors.
In this chapter we have presented an efficient method to estimate the membership
parameters and dynamically update them. The method is based on the minimization of a
regularized version of the risk functional which is a measure of the inadequacy of a given
estimation.
The regularization of the risk is made possible by the introduction of a RKHS to which
the membership function belongs. Such an RKHS has the property of having a kernel
function that generates it. By means of the kernel function each feature vector can be
transformed in a function in RKHS.
In addition to make the regularization possible, the RKHS provides the membership
function with a particularly flexible structure as a linear combination of the kernel functions
108 Chapter 5. Pattern recognition
associated with the training elements. Thanks to this structure a geometrical interpretation
for the membership parameters in terms of a separating hyperplane can be derived and their
updating equations are particularly straightforward.
The RKHS properties and ability to classify the feature vectors into classes defined by
their membership depend on the choice of the Kernel function. A suitable kernel function
is the Gaussian kernel which permits to easily control the over-fitting and generalization
through its parameter σ. Such parameter is selected using a theoretical bound on the
generalization error.
Protocols and evaluation 6
“It is a capital mistake to theorize before one has data” Sir Arthur Conan Doyle
6.1 Introduction
In previous chapters we presented the process through which, a vector of memberships is
obtained from an EEG-trial free of artifacts. This vector is sent to the action generation
module (see Fig. 6.1) which, in accordance with a set of rules (action rules), produces
commands that act on a computer-rendered environment (CRE). These rules are set exper-
imentally and depend on the MAs used to operate the BCI and the subject performance.
Henceforth, unless otherwise specified, the term MA refers to a mental activity used to
operate the BCI.
Figure 6.1. The vector of memberships computed by the pattern recognition module is sent to the
action generation one which, in accordance with a set of rules (action rules), produces commands
that act on a computer rendered environment. The action rules are set experimentally depending
on the MAs and on the subject control skills.
109
110 Chapter 6. Protocols and evaluation
Figure 6.2. Electrodes of the ten-twenty international system at which EEG was measured. Elec-
trode Cz was taken as physical reference. As mentioned in (Chapter 3, Section 3.2.3), the signals
were re-referenced with respect to their average.
In this chapter we apply the preprocessing, feature extraction, and recognition algo-
rithms presented in previous chapters, for the training of six subjects who participated in
nine training sessions, in the framework of an asynchronous 2D object positioning applica-
tion.
The first three sessions served to set up initial recognition models for each MA, determine
the optimal feature extraction methods or mappings (see Chapter 4), and establish the
action rules.
In the next training sessions, feedback was provided, indicating the subjects how well
the BCI recognized the MAs they were asked to perform. At the end of each session the
recognition models were updated. The controlling skills acquired by the subjects in these
sessions were assessed through positioning tests in which the subjects had to move an object
on the screen (by performing the trained MAs) to reach a goal. The training schedule was
adjusted in function of the recognition error associated with each MA. Thus, those MAs
associated with high recognition errors were trained more often.
MA1: Left index finger imagination Left index finger movement imagination
MA2: Right index finger imagination Right index finger movement imagination
the two types of training sessions and report the corresponding results.
Three training-without-feedback sessions were carried out before the training-with-
feedback sessions, in order to collect enough data to estimate the recognition models.
Figure 6.3. CRE in which the object positioning application takes place. The spaceship moves to
the left, right, up, and down when the BCI recognizes MA1, MA2, MA3, and MA4 respectively.
The CRE is a one-hundred step square where a step corresponds to the smallest movement that the
spaceship can execute.
Figure 6.5. Visual cues used to indicate the MA that has to be performed during an active-time.
The absence of the sun and spaceship signaled the outset of a break-time whose EEG-trials were
considered as produced during the performance of MA0.
MA. The end of an active-time and consequently the outset of a break-time was signaled
by the absence of the spaceship and the target in the CRE. The MA corresponding to each
active-time was randomly chosen among the four MAs. Yet, we ensured that each MA was
requested at least 22 times in each training-without-feedback session.
The EEG-trial duration and action period (see Chapter 2, Section 2.4 for the definition
of these parameters) were set to 2000 and 500 milliseconds respectively. Using these values,
nine EEG-trials per active-time and an average of four EEG-trials per break-time were
potentially available. Indeed, an active or break-time was discarded if an artifact was
detected in it. The presence of an ocular or muscular artifact was signaled to the subject
by a vertical respectively horizontal oscillation of the sun. The number of EEG-trials per
MA that were available at the end of the third training-without-feedback session is reported
in Table 6.2 for each subject.
114 Chapter 6. Protocols and evaluation
MA S1 S2 S3 S4 S5 S6
The parameters and the dimension of the corresponding feature vectors (denoted as D) of
the mappings based on linear prediction parametric models, i.e. the autoregressive (denoted
as ψAR ; see Chapter 4, Section 4.5), non-stationary AR (denoted as ψNAR ; see Chapter 4,
Section 4.6), and multivariate AR (denoted as ψMVAR ; see Chapter 4, Section 4.7) are
reported in Table 6.4. Notice that we chose the same orders Qm and Um for each channel.
In fact, such values were chosen as the maximum of the values given by the MDL (see
Section 4.5) criterion for each channel.
The optimal mapping for the recognition of a given MA was chosen by considering the
recognition error associated with each possible mapping. We recall that an EEG-trial is
considered as wrongly recognized if its membership is wrongly decided by the membership
function (see Chapter 5, Section 5.5). Thus, among the recognition errors associated with
6.3. Training without feedback 115
Table 6.4. Parameters of the mappings based on linear prediction models for each subject
Mapping S1 S2 S3 S4 S5 S6
each mapping, the optimal mapping was the one associated with the lowest recognition
error.
To obtain the recognition error associated with a given mapping and MA (which we call
targeted MA), we proceeded as follows. Sixty percent of the EEG-trials (positive trials)
available for the targeted MA and the same number of EEG-trials (negative trials) randomly
chosen among the other MAs and MA0 were used to build the recognition models1 using the
algorithm explained in Chapter 5, Section 5.3. The rest of the positive EEG-trials and an
equal number of EEG-trials (which were not used to build the recognition models) randomly
chosen among the other MAs were used to determine the recognition error. To improve
the quality of the recognition error estimate, the cross-validation procedure described in
Chapter 5, Section 5.5.1 was used. Figure 6.6 shows the recognition errors for each mapping,
MA, and subject. The recognition errors are reported in fractions, i.e. a recognition error
equal to 0.5 implies that 50% of the tested EEG-trials were wrongly recognized.
Table 6.5 shows the optimal mapping and the associated recognition error for each MA,
and subject. From the results reported in this Table, it appears that the optimal mapping
choice for each MA is subject dependent. It is worth noticing that for each subject a
dominant mapping can be distinguished. In particular, for subjects S1 and S3 the ψNAR
and ψY mappings are the optimal ones for each MA.
Figures 6.7 to 6.12 depict the experimental distribution (computed using the optimal
mappings reported in Table 6.5) of the normalized membership (see Chapter 5, Section 5.2),
associated with each MA for subjects S1 to S6 respectively. We recall that a positive EEG-
trial is correctly recognized when its normalized membership is larger than one. On the
other hand, a negative EEG-trial is correctly recognized when its normalized membership
is smaller than minus one.
It is worth mentioning that positive and negative EEG-trials are relative to the targeted
MA. Thus, an EEG-trial generated during the performance of MA1 is a positive trial with
respect to MA1 and a negative one with respect to any other MA.
1
As a matter of fact, the recognition models are composed of the membership parameters, namely the
offset, threshold, and membership function (see Chapter 5 for details)
116 Chapter 6. Protocols and evaluation
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 ψ 0
ψP ψC ψ NAR
ψY ψMVAR ψ ψC ψ ψ ψY ψ
AR P AR NAR MVAR
0.5 0.5
0.1 0.1
0 ψ 0 ψ
ψP ψC ψAR ψ Y
ψ ψ ψC AR
ψ ψ ψ
NAR MVAR P NAR Y MVAR
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 ψNAR 0
ψP ψC ψAR ψY ψ ψP ψ ψAR ψ ψ ψMVAR
MVAR C NAR Y
Mapping Mapping
Figure 6.6. Recognition errors for each mapping, MA, and subject. The values are reported
in fractions, i.e. a recognition error equal to 0.5 implies that 50% of the tested EEG-trials were
wrongly recognized. The numerical values presented in this figure are reported in Appendix B,
Section B.1. For a given MA, the mapping providing the smallest recognition error is chosen as
the optimal mapping for this MA. For instance, in the case of subject S1, the non-stationary AR
mapping constitutes the optimal one for each MA.
The optimal mappings and the corresponding recognition errors for each MA, and subject are
reported in Table 6.5.
6.3. Training without feedback 117
Table 6.5. Choice of the optimal mapping for each MA and subjecta
MA S1 S2 S3 S4 S5 S6
MA2 ψNAR (0.125) ψNAR (0.239) ψY (0.306) ψNAR (0.196) ψC (0.157) ψNAR (0.172)
MA4 ψNAR (0.133) ψNAR (0.287) ψY (0.214) ψMVAR (0.294) ψC (0.295) ψY (0.065)
a
The numbers in parenthesis correspond to the associated recognition errors
70 60
60 50
50
40
40
30
30
20
20
10
10
0 0
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
Normalized membership MA1 Normalized membership MA2
Positive trials
Negative trials
60 60
50 50
40 40
30 30
20 20
10 10
0 0
−3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
Normalized membership MA3 Normalized membership MA4
Figure 6.7. Subject S1: Distribution of the normalized memberships corresponding to positive and
negative EEG-trials for each MA.
118 Chapter 6. Protocols and evaluation
100
160
90
140
80
120
70
60 100
50 80
40
60
30
40
20
10 20
0 0
−4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2
Normalized membership MA1 Normalized membership MA2
Positive trials
Negative trials
80 120
70
100
60
80
50
40 60
30
40
20
20
10
0 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Normalized membership MA3 Normalized membership MA4
Figure 6.8. Subject S2: Distribution of the normalized memberships corresponding to positive and
negative EEG-trials for each MA.
60
30
50
25
40
20
30
15
20
10
5 10
0 0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
Normalized membership MA1 Normalized membership MA2
Positive trials
Negative trials
55
50
50
45
40
40
35
30 30
25
20 20
15
10 10
5
0 0
−2 −1 0 1 2 3 −2 −1 0 1 2 3
Normalized membership MA3 Normalized membership MA4
Figure 6.9. Subject S3: Distribution of the normalized memberships corresponding to positive and
negative EEG-trials for each MA.
6.3. Training without feedback 119
60
80
50 70
60
40
50
30
40
20 30
20
10
10
0 0
−3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2
Normalized membership MA1 Normalized membership MA2
Positive trials
Negative trials
50
70
45
40 60
35
50
30
40
25
20 30
15
20
10
10
5
0 0
−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2
Normalized membership MA3 Normalized membership MA4
Figure 6.10. Subject S4: Distribution of the normalized memberships corresponding to positive and
negative EEG-trials for each MA.
160
90
140 80
120 70
100 60
50
80
40
60
30
40
20
20 10
0 0
−4 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
Normalized membership MA1 Normalized membership MA2
Positive trials
Negative trials
80
200
180 70
160
60
140
50
120
100 40
80 30
60
20
40
10
20
0 0
−3 −2 −1 0 1 2 3 4 5 −2 −1 0 1 2 3 4
Normalized membership MA3 Normalized membership MA4
Figure 6.11. Subject S5: Distribution of the normalized memberships corresponding to positive and
negative EEG-trials for each MA.
120 Chapter 6. Protocols and evaluation
80
40
70
35
60
30
50
25
40
20
30 15
20 10
10 5
0 0
−2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2
Normalized membership MA1 Normalized membership MA2
Positive trials
Negative trials
45
40
40
35
35
30
30
25
25
20
20
15
15
10 10
5 5
0 0
−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4
Normalized membership MA3 Normalized membership MA4
Figure 6.12. Subject S6: Distribution of the normalized memberships corresponding to positive and
negative EEG-trials for each MA.
6.4. Action rules 121
where uk is the smallest value of ζk for positive trials, vk corresponds to the intersection
between the positive and negative trials distribution of ζk (see Fig. 6.13), and Mstp is the
maximum number of steps that the spaceship is allowed to move in a single action. In our
experiments Mstp was equal to 8.
The slope of kk was set so as to have kk = 1 for ζk = vk . The number of steps that the
spaceship moved in the direction corresponding to MAk was equal to the nearest integer
function of kk , namely nint(kk ).
For each EEG-trial, four strengths: k1 , . . . k4 were computed. The spaceship moved
by the corresponding number of steps in each direction. Notice that in a single action the
spaceship could simultaneously move in several directions. This possibility was allowed
during the positioning tests (Section 6.5). However, in training-with-feedback sessions only
the strength associated with the trained MA was considered.
Figure 6.13. Action strength (kk ) associated with MAk. The number of steps that the spaceship
moved in the direction corresponding to MAk was equal to the nearest integer function of kk .
The action strength depended on the experimental distribution of ζk . In particular, the value of the
smallest ζk for positive trials (denoted as uk ) and the value of ζk corresponding to the intersection
of the positive and negative trials distribution of ζk (denoted as vk ) determined the shape of kk .
The action strength was limited by the maximum number of steps (Mstp ) that the spaceship was
allowed to move in a single action.
The second part consisted in assessing the controlling skills acquired during the session
by running two positioning tests. In each run, the positions of the target and the spaceship
were randomized in such a way that at least fifty steps were necessary for the spaceship
to reach the target. It is important to mention that the horizontal and vertical distances
between the target and spaceship were both equal to twenty-five steps. As two bits are
necessary to encode a single step in a given direction (up, down, left, or right), the number
of bits that are needed to encode any optimal trajectory between the initial position of the
spaceship and the target is equal to one-hundred bits. By measuring the average time a
subject took to reach the target we can experimentally measure the bit-rate reached in the
session.
As in training-without-feedback sessions, the EEG-trials with artifacts were discarded.
In Appendix B, Section B.2 we report the number of EEG-trials that were available after
artifact detection for each MA and subject.
The recognition errors (for each MA) corresponding to each session were determined
using the recognition models updated at the end of the previous session. This makes sense
since feedback was provided using such models. The recognition error of the first training-
with-feedback session was determined using the recognition models built at the end of the
training-without-feedback sessions.
The evolution through the sessions of the recognition errors associated with each MA,
are reported in the four upper graphs in Figs. 6.15 to 6.20 for subjects S1 to S6 respectively.
The values represented in these curves are reported in the Appendix B, Section B.2.
The recognition errors are represented with their two components, namely the false
negative (FN) and false positive (FP) fractions. The FN is the fraction of positive EEG-
trials whose normalized membership were smaller than one, and the FP is the fraction
of negative EEG-trials whose normalized membership were larger than minus one. The
recognition error is related to FN and FP by means of the following equation.
FN FP
Recognition error = Nn
+ N
(6.2)
1 + Np 1 + Nnp
where Nn and Np are the number of negative, respectively positive EEG-trials that were
used to compute the recognition error.
To globally evaluate a session, we computed the theoretical bit rate (in bits per minute)
by adapting the formula given in Chapter 2, Section 2.10 as follows.
60 pe
Bit rate = log2 NMA + (1 − pe ) log2 (1 − pe ) + pe log2 (6.3)
Tact NMA − 1
where pe is the mean recognition error over the MAs, NMA is the number of MAs (equal
to four), and Tact is the action period in seconds (equal to 0.5 seconds). The theoretical
bit rates for each session are represented on the lower left graph in Figs. 6.15 to 6.20 for
subjects S1 to S6 respectively..
Since the bit rate computed by means of (6.3) considers EEG-trials that are free of
artifacts, it constitutes an over optimistic estimation of the bit rate that can be achieved
124 Chapter 6. Protocols and evaluation
during actual BCI operation. Therefore, two positioning tests were carried out at the end of
each session. As mentioned before, an experimental estimate of the bit rate can be obtained
by dividing one hundred (i.e. the number of bits needed to encode an optimal trajectory)
by the average time spent in reaching the target. The lower right graph in Figs. 6.15 to 6.20
depicts the bit rates experimentally estimated for subjects S1 to S6 respectively. Missing
values correspond to those positioning tests in which the target could not be reached. These
situations were due to the subjects who were free to interrupt the experiment at any time.
The time spent in reaching the target for each positioning test, session, and subject are
reported in Appendix B, Section B.2.
0.15
0.15
0.1
0.1
0.05
0 0.05
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Error evolution MA3 Error evolution MA4
0.25 0.25
FN FN
FP FP
0.2 0.2
Error Error
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Theoreticals bit rate Experimental bit rate
170
25
160
20
150
bits/min
bits/min
15
140
10
130 5
120 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Figure 6.15. Subject S1.Top four graphs: Evolution of the recognition errors, associated with MA1
to MA4, throughout training-with-feedback sessions. The recognition errors are reported along with
their two components, namely the false negative (FP) and false positive (FP) fractions. Bottom left:
Theoretical bit rate. Bottom right: Experimental bit rate. Missing values correspond to positioning
tests in which the target was not reached.
Numerical values presented in these graphs are reported in Appendix B, Section B.2
6.5. Training with feedback 125
0.25 0.2
0.2 0.15
0.15 0.1
0.1 0.05
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Error evolution MA3 Error evolution MA4
0.7 0.7
FN
0.6 0.6 FP
Error
0.5 0.5
0.4 0.4
0.3 0.3
FN
0.2 FP 0.2
Error
0.1 0.1
0 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Theoreticals bit rate Experimental bit rate
130
120 20
110
15
bits/min
bits/min
100
10
90
5
80
70 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Figure 6.16. Subject S2.Top four graphs: Evolution of the recognition errors, associated with MA1
to MA4, throughout training-with-feedback sessions. The recognition errors are reported along with
their two components, namely the false negative (FP) and false positive (FP) fractions. Bottom left:
Theoretical bit rate. Bottom right: Experimental bit rate.
Numerical values presented in these graphs are reported in Appendix B, Section B.2
126 Chapter 6. Protocols and evaluation
0.3
0.3
0.25
0.25
0.2
0.2 0.15
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Error evolution MA3 Error evolution MA4
0.22 0.3
FN FN
0.2 FP FP
Error Error
0.18 0.25
0.16
0.14 0.2
0.12
0.1 0.15
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Theoreticals bit rate Experimental bit rate
115
110 20
105
15
bits/min
bits/min
100
95 10
90
5
85
80 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Figure 6.17. Subject S3.Top four graphs: Evolution of the recognition errors, associated with MA1
to MA4, throughout training-with-feedback sessions. The recognition errors are reported along with
their two components, namely the false negative (FP) and false positive (FP) fractions. Bottom left:
Theoretical bit rate. Bottom right: Experimental bit rate. Missing values correspond to positioning
tests in which the target was not reached.
Numerical values presented in these graphs are reported in Appendix B, Section B.2
6.5. Training with feedback 127
0.1
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Error evolution MA3 Error evolution MA4
0.36 0.4
FN
0.34 FP
0.35 FN
Error
0.32 FP
Error
0.3 0.3
0.28 0.25
0.26
0.2
0.24
0.22 0.15
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Theoreticals bit rate Experimental bit rate
110
25
105
20
bits/min
bits/min
100 15
10
95
5
90 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Figure 6.18. Subject S4.Top four graphs: Evolution of the recognition errors, associated with MA1
to MA4, throughout training-with-feedback sessions. The recognition errors are reported along with
their two components, namely the false negative (FP) and false positive (FP) fractions. Bottom left:
Theoretical bit rate. Bottom right: Experimental bit rate.
Numerical values presented in these graphs are reported in Appendix B, Section B.2
128 Chapter 6. Protocols and evaluation
0.24 0.15
FN
0.22 FP
0.1 Error
0.2
0.18 0.05
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Error evolution MA3 Error evolution MA4
0.22 0.35
FN
0.2 FP
Error
0.18
0.16
0.3
0.14
FN
0.12 FP
Error
0.1
0.08 0.25
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Theoreticals bit rate Experimental bit rate
114
35
112
30
110
25
bits/min
bits/min
108
20
106 15
104 10
102 5
100 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Figure 6.19. Subject S5.Top four graphs: Evolution of the recognition errors, associated with MA1
to MA4, throughout training-with-feedback sessions. The recognition errors are reported along with
their two components, namely the false negative (FP) and false positive (FP) fractions. Bottom left:
Theoretical bit rate. Bottom right: Experimental bit rate.
Numerical values presented in these graphs are reported in Appendix B, Section B.2
6.5. Training with feedback 129
0.17
0.15
0.16
0.1 0.15
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Error evolution MA3 Error evolution MA4
0.185 0.12
FN
0.18 FP
Error 0.1
0.175
0.17 0.08
0.165 FN
0.06 FP
Error
0.16
0.155 0.04
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Theoreticals bit rate Experimental bit rate
140
20
138
136 15
bits/min
bits/min
134
10
132
130
5
128
126 0
1 2 3 4 5 6 1 2 3 4 5 6
Session Session
Figure 6.20. Subject S6.Top four graphs: Evolution of the recognition errors, associated with MA1
to MA4, throughout training-with-feedback sessions. The recognition errors are reported along with
their two components, namely the false negative (FP) and false positive (FP) fractions. Bottom left:
Theoretical bit rate. Bottom right: Experimental bit rate. Missing values correspond to positioning
tests in which the target was not reached.
Numerical values presented in these graphs are reported in Appendix B, Section B.2
130 Chapter 6. Protocols and evaluation
6.5.1 Discussion
The evolution of the recognition errors throughout the training-with-feedback sessions for
each MA and subject (see Figs. 6.15 to 6.20) shows a clear downwards trend. Figure 6.21
depicts the relative decrease of the recognition error between two consecutive sessions for
each MA, and subject. The relative decrease of the recognition error associated with MAk,
between sessions i + 1 and i ∈ {1, . . . , 5} was obtained by subtracting the recognition
error associated with MAk corresponding to session i + 1 from that of session i. Negative
values of the relative decrease indicate that the recognition error increased with respect to
that of the previous session. Only a few negative values appear in Fig. 6.21. Notice that
increases in the recognition error never affected the whole set of MAs. This explains why
the theoretical bit rate, which can be thought of as an aggregate measure of the recognition
errors associated with each MA, exhibits an upwards trend.
It is worth mentioning that whereas the false negative and false positive fractions do
not necessarily exhibit a strict downwards trend (this is specially true for subject S1, see
Fig. 6.15), the recognition models updating ensures that the recognition errors do not
increase or at least not to the same extent. This sort of automatic control clearly appears
in Fig. 6.15 in which, the increases in the false negative fraction curves are countered by
corresponding decreases in the false positive fraction curves.
Figure 6.22 shows the relative increase of the theoretical (top) and experimental (bot-
tom) bit rate over sessions for each subject. Relative increase of theoretical (experimental)
bit rates between sessions i + 1 and i were obtained by subtracting the theoretical (experi-
mental) bit rates corresponding to session i from that of session i + 1. The missing values in
the experimental bit rates were replaced by the values corresponding to previous sessions,
i.e. we assume that the controlling skills were maintained since subjects themselves decid-
ed to interrupt positioning testings. The hypothesis of controlling skills remanence seems
reasonable, as the subsequent experimental bit rates always increase.
The theoretical bit rates for each subject almost always increase; the exception cor-
responds to the relative theoretical bit rate between sessions two and one for subject S1
which exhibit a slight decrease. This is confirmed by Fig. 6.21 in which the relative error
decrease between sessions two and one for subject S1, shows that three out of the four MAs
take negative values. Thus, the global theoretical evaluation of sessions indicates that the
information transfer (measured by the bit rate) and consequently BCI operation improved
over sessions.
A similar upwards trend is observed for the experimental bit rate. Since missing values
were replaced by those corresponding to previous sessions, null increases correspond to
those sessions. Yet, the relative experimental bit rate is always positive, meaning that when
subjects carried out the positioning tests they always improved their previous performance.
At the end of the sixth session subjects reached experimental bit rates of 26, 22, 21,
27, 35, and 19 bits per minute respectively. Thus, an average of 22 bits per minute was
achieved. This result situates our work among the most outstanding in the BCI research
community (see Table 2.1). It is worth mentioning that while the theoretical bit rate gives
6.6. Summary 131
much higher values, the experimental bit rate presents the advantage of being measured in
the framework of a real application, and thus constitutes a closer approximation to the real
information transfer rate.
Subject S1: relative error decrease throughout sessions Subject S2: relative error decrease throughout sessions
0.04 0.08
0.03 0.06
Bits per minute
0.02 0.04
0.01 0.02
0 0
0.01 0.02
0.02 0.04
2-1 3-2 4-3 5-4 6-5 2-1 3-2 4-3 5-4 6-5
Subject S3: relative error decrease throughout sessions Subject S4: relative error decrease throughout sessions
0.04 0.03
0.03 0.02
Bits per minute
MA1
MA2
0.02 MA3 0.01
MA4
0.01 0
0 0.01
2-1 3-2 4-3 5-4 6-5 21 32 43 54 65
3
Subject S5: relative error decrease throughout sessions x 10 Subject S6: relative error decrease throughout sessions
0.015 20
15
Bits per minute
0.01
10
5
0.005
0
0 5
2-1 3-2 4-3 5-4 6-5 2-1 3-2 4-3 5-4 6-5
Session indexes Session indexes
Figure 6.21. Relative error decrease over training-with-feedback sessions for each MA and subject.
The relative decrease of the recognition error associated with MAk, between sessions i + 1 and
i ∈ {1, . . . , 5} was obtained by subtracting the recognition error of MAk corresponding to session
i + 1 from that of session i. Negative values of the relative decrease indicate that the recognition
error increased with respect to that of the previous session.
6.6 Summary
In this chapter we applied the artifact detection, feature extraction, and pattern recognition
algorithms developed in previous chapters, to the training of six subjects throughout nine
training sessions in the framework of an asynchronous 2D positioning application. Four
mental activities were used to move an object in the screen in four possible directions,
namely left, right, up, and down.
The first three sessions served to build the initial recognition models through the choice
of the optimal mappings for each MA and subject. Moreover, the action rules which de-
132 Chapter 6. Protocols and evaluation
termined the BCI operation were set with respect to the distribution of the normalized
memberships associated with each MA.
The next six sessions were used to adjust the BCI recognition models and improve user
controlling skills by means of feedback. Mental activities were trained on in such a way
that those MAs that had large recognition errors were trained on more often.
In addition, in each of the last six sessions positioning tests were carried out to evaluate
the subject controlling skills in a real application. The bit rate was estimated both theoret-
ically and experimentally. Both estimates exhibited clear upwards trends throughout the
sessions.
Theoretical bit rate increase
12
10
8
Bits per minute
2
2-1 3-2 4-3 5-4 6-5
S1
Experimental bit rate increase S2
3.5 S3
S4
3 S5
S6
2.5
Bits per minute
1.5
0.5
0
2-1 3-2 4-3 5-4 6-5
Session indexes
Figure 6.22. Theoretical (left) and experimental (right) bit rate increase over training-with-feedback
sessions for each subject. Relative increase of theoretical (experimental) bit rates between sessions
i + 1 and i were obtained by subtracting the theoretical (experimental) bit rates corresponding to
session i from that of session i + 1. The missing values in the experimental bit rates were replaced
by the values corresponding to previous sessions, i.e. we assume that the controlling skills were
maintained.
Conclusions 7
In this chapter we review the most important issues and contributions presented in this
thesis. Then we discuss possible extensions and continuations to the presented work.
The objectives of this thesis were to:
• Design and develop an asynchronous operant conditioning based BCI system which
implements three adaptation levels, namely initial adaptation to the subject’s signal
characteristics, continuous adjustment of the BCI to maintain subject’s controlling
skills and reduce the impact of possible EEG changes, and subject adaptation through
feedback.
• Ensure that the BCI is not controlled by other type of signals such as ocular and
muscular artifacts.
• An asynchronous operant conditioning BCI that operates with four mental activities
in the framework of a 2D object positioning application was developed. Such BCI
produces actions each half second based on the analysis of the last two second long
EEG segment (EEG trial).
133
134 Chapter 7. Conclusions
during a calibration procedure that took place before each experimental session. The
BCI did not attempted to generate an action from an EEG trial with an artifact in
it. Instead, it generated especial actions to notify the subject which type of artifact
had been detected.
• Several types of feature extraction methods were considered to characterize the EEG
trials produced during each controlling MA. From a general framework that con-
sidered the generalized interaction between the univariate signals composing EEG,
different feature extraction methods (or mappings) were derived by assuming certain
hypotheses on the nature of EEG. For a given MA, the mapping associated with
the lowest recognition error was chosen. Thus, the BCI presented in this thesis used
multiple types of feature vectors to operate.
• Recognition of MAs from feature vectors was done through the use of kernel based
learning methods. We have developed an efficient theoretical framework which per-
mits the dynamic updating of recognition models parameters as new training data
become available.
• Definition of action rules that adapt the BCI operation mode to the subject perfor-
mance. As the recognition models are dynamic these rules change accordingly.
• The algorithms and methods mentioned above were applied to the training of six
subjects who participated in nine training sessions. The controlling skills acquired
by subjects were measured using a theoretical and a experimental measure of the bit
rate. Through the sessions, both measures increased for each subject. At the end of
the ninth session, an average, over subjects of 126 and 25 bits per minute respectively
was achieved. This result situates our research among the most outstanding ones in
the BCI community.
• In this thesis we have considered a hierarchical model for the recognition of MAs,
i.e. feature vector extraction and classification were done independently. A possible
way to simultaneously consider the feature extraction and recognition problems would
consist in applying the Bayesian framework which is able to select those features that
make the recognition error lower.
• Generative models such as hidden Markov models (HMMs) were considered in the
framework of synchronous BCI operation only. A generalization to asynchronous
7.2. Future directions 135
operation can be made through ergodic HMMs. The training of such models could be
done in two phases. In the first stage general states can be identified, i.e. characterizing
all controlling MAs. In a second stage, the transitions that specifically characterize
each MA can serve as recognition model for such an MA. The advantage of this
approach resides in the fact that through the study of statistical properties of each
state, valuable physiological insights into the nature of the MAs can be obtained.
• In this thesis we have considered four MAs that were chosen in accordance with current
BCI studies which in turn made their choice based on hemispheric brain specialization
studies. In addition to the selection of the optimal feature spaces in which these MAs
can be recognized, it can be important to select the MAs as well. In this way, subjects
could select the MAs through which they are best able to operate the BCI.
• The improvements obtained by using more adequate signal processing and machine
learning algorithms aim at achieving large communication bandwidth as measured
by the information transfer rate. To some extent, the most promising avenues for
improvement will be determined by the particular application. Indeed, while higher
information transfer rate is clearly desirable, the design of intelligent applications can
handle much of the communication details. In this way, the subject can focus on
communicating goals rather than on the details of control.
• BCI research should adhere to standards for designing studies and for assessing and
comparing their results, both in the laboratory and in actual applications. In this
way, direct comparisons among different BCI designs will be facilitated.
• The degrees of freedom required for adaptive automation of cognitive tasks, pros-
thetics, and complex robotics may lie beyond the range of current BCI signals and
methods. However, BCIs can be used in combination with other human-computer
interface devices. In particular, through the study of the mutual influence (through
statistical measurements such as mutual information) between signals coming from
other input devices and EEG, one can determine the extent to which EEG can enrich
human-computer interaction.
• The position and number of electrodes can be optimized in accordance with physi-
ological considerations and evaluation criteria. Feature selection algorithms can be
used to rank the electrodes following their discriminative power among the MAs used
to operate the BCI. Yet, a minimum number of electrodes should be maintained in
order to cope with possible EEG changes.
136 Chapter 7. Conclusions
Appendix A
A.1 Membership boundary induced by the Gaussian kernel
In Chapter 5 we described the membership parameters in the functional space H. In par-
ticular, we saw that the separating boundary between the images, under the map φ of
the feature vectors belonging and not belonging to the target set, is a hyperplane. In this
section we study the shape of the separation boundary in the feature vector space X .
If the map φ uses a Gaussian kernel function, we know that a small value of the Gaussian
kernel parameter lead to a small fractions of training error (FTE) (see Proposition 5.4). On
the other hand, in real applications the joint distribution of feature vectors and labels asso-
ciated to the training set does not necessarily reflect the real joint distribution. In addition,
training errors are possible (this is particularly true in the BCI framework). Consequently,
having a too small FTE is not required and even not suitable since the membership pa-
rameters can over-fit the training data and exhibit poor performance in determining the
membership of unseen feature vectors.
In order to have a small FTE, one can intuitively understand that intricate separating
boundaries in X are needed. As a matter of fact, it can be shown that there is connection
between the minimization of the regularized risk and the ”complexity” of the decision
boundary [147]. Complexity in this context means that the separating boundary is highly
irregular and intricate.
To illustrate the connection between the complexity of the separating boundary in X
and the Gaussian kernel parameter we report the solutions obtained in the framework of a
2D toy problem.
In Fig. A.1 we depict the separating boundaries (between the dots and crosses) asso-
ciated to different values of the Gaussian kernel parameter normalized to the minimum
Euclidean distance in the training set (see Eq. 5.61) as it can be seen, the smaller σr the
137
138 Chapter A. Appendix
more complex the decision boundary and the smaller the FTE.
As discussed in Proposition 5.4, the fraction of support vectors (FSV) decreases as
σ increases. In Fig. A.2 we report the distribution of the absolute value of the expansion
coefficients in function of σr . For σr sufficiently small, the expansion coefficients are all equal
to Lν , where L is the number of training elements, and ν = 21 (see Proof of proposition 5.4)
and consequently the FSV is equal to one.
In Section 5.3.5 we mentioned that the membership of a new feature vector is complete-
ly determined by the support vectors because their associated expansion coefficients are
non-zero. In fact, the Gaussian kernel parameter is associated with the area of influence
associated with a support vector. For a small σ, the area of influence is small and a large
number of support vectors is required to define the separating boundary which is highly ir-
regular. On the other hand, a large σ allow a support vector to have a strong influence over
a larger area reducing thus the number of support vectors needed to define the separating
boundary.
In Fig. A.3 we report the evolution of the FSV, the FTE and the estimation of the
generalization error via cross-validation (GE) (see Section 5.5.1) for growing σ. As σ in-
creases, the FTE increases and the FSV decreases. However, the GE exhibit a minimum
for σr near eight. This value constitutes a good compromise between generalization and
training errors. The optimal choice of σ is determined according to the procedure detailed
in Section 5.5.2.
min R2 (A.1)
constrained to:
L
X
ΛP = R 2 − λl R2 − kC − φ (xl )k2H (A.3)
l=1
Computing the derivatives of ΛP with respect to R and C and setting them to zero
A.2. Computing the radius of the smallest sphere containing the training
data 139
0 0 0
−1 −1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
σ =3.96 σ =5.58 σ =7.87
r r r
1 1 1
0 0 0
−1 −1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
σ =11.09 σ =15.64 σ =22.05
r r r
1 1 1
0 0 0
−1 −1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure A.1. Separating boundary (to discriminate between dots and crosses) for growing values
of the Gaussian kernel parameter normalized to the minimum Euclidean distance in the training
set (i.e. σr = ∆mσ in ). The dark region encloses the dots. As σr increases more training errors are
allowed and the separating boundary becomes more regular.
leads to
L
X
∂R Λ P = 0 ⇒ λl = 1 (A.4)
l=1
L
X
∂C Λ P = 0 ⇒ C = λl φ (xl ) (A.5)
l=1
By replacing (A.4) and (A.5) in (A.3) we obtain the dual lagrangian that should be
maximized with respect to λ1 , . . . , λL . Then,
L
X X
R2 = max λl − λl1 λl2 Kσ (xl1 , xl2 ) (A.6)
λ1 ,...,λL
l=1 l1,l2
140 Chapter A. Appendix
40
100 100
30
20
50 50
10
0 0 0
0 2 4 0 2 4 0 2 4
σr=3.96 −3 σr=5.58 −3 σr=7.87 −3
x 10 x 10 x 10
40 60 80
30 60
40
20 40
20
10 20
0 0 0
0 2 4 0 2 4 0 2 4
σ =11.09 −3 σ =15.64 −3 σ =22.05 −3
r x 10 r x 10 r x 10
100 100 100
80 80 80
60 60 60
40 40 40
20 20 20
0 0 0
0 2 4 0 2 4 0 2 4
−3 −3 −3
x 10 x 10 x 10
Figure A.2. Distribution of the absolute value of the expansion coefficients for growing values
of σr . As σr increases more expansion coefficients become equal to zero. For large values of σr
the distribution of the expansion coefficients becomes bimodal, i.e. concentrated in zero and L1 (the
maximum allowed value).
subject to
L
X
λl = 1 (A.7)
l=1
constrained to (A.7).
This convex quadratic problem [23] can be readily solved using standard quadratic
programming techniques [168].
A.3. Computing the derivative of the theoretical bound B with respect to
σ 141
1
FSV
FTE
0.9 GE
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2 4 6 8 10 12 14 16 18 20 22
σr
Figure A.3. Evolution of the fraction of support vectors (FSV), the fraction of training errors
(FTE) and the cross-validation estimate of the generalization error (GE) in function of σr . While
the FSV and the FTE exhibit a monotonically behavior in function of σr , the GE has a minimum
which corresponds to the tradeoff between generalization and the FTE.
∂R2
A.3.1 Computing ∂σ
α
~ fixed
This derivative can be computed directly from (A.6) using the following lemma [26]:
142 Chapter A. Appendix
Let u∗ ∈ Υ be the vector where the maximum in G (ζ) is attained. If this maximum is
unique then
∂G (ζ) ∂vζ 1 ∂Aζ ∗
= (u∗ )t − (u∗ )t u
∂ζ ∂ζ 2 ∂ζ
In other words, it is possible to differentiate G with respect to ζ just as if u∗ did not depend
on ζ.
Thus, from (A.6) we have:
∂R2 X
= − λl1 λl2 ∂σ Kσ (xl1 , xl2 ) (A.11)
∂σ α~ fixed
l1,l2
∂kwk2H
A.3.2 Computing ∂σ
α~ fixed
∂ρ
A.3.3 Computing ∂σ α ~ fixed
∂~
α
A.3.4 Computing ∂σ
We define the set L = l| 0 < α̃l < L1 and the vector α ~ L = [α̃l∈L , b]t (i.e. the vector com-
posed of those α̃’s corresponding to the φ (xl ) that are on the margins and the membership
1
threshold b). Since α̃l∈L
/ is either 0 or L , it is clear that:
∂ α̃l∈L
/
=0
∂σ
.
Using the results shown in Fig. 5.7 and (5.40) we have:
KY YL 1
α~ L = ρ |L| (A.14)
YLt 0 0
| {z } | {z }
K U
A.3. Computing the derivative of the theoretical bound B with respect to
σ 143
where 1|L| is the |L|×1 matrix with unitary elements, KY is a |L|×|L| matrix with elements
Kl,m∈ L = yl ym Kσ (xl , xm ) and YL = (yl∈L ) (i.e. the |L| × 1 matrix of labels corresponding
to the φ(xl ) that are on the margins).
The matrix K is always invertible, then:
∂~
αL ∂ K−1 ρU
= (A.15)
∂σ ∂σ
∂K ∂ρ
= −ρK−1 K−1 U + K−1 U (A.16)
∂σ ∂σ
∂K−1
The derivative ∂σ is computed using:
K−1 K = I
⇒ ∂σ K−1 K + K−1 ∂σ K = 0
∂K−1 −1 ∂K
⇒ = −K K−1
∂σ ∂σ
∂ kwk2H X
= yl ym α̃m Kσ (xl , xm ) + 2α̃l (A.17)
∂ α̃l∈L
m6=l|m∈L
∂ kwk2H
= 0 (A.18)
∂bk
∂ρ
A.3.6 Computing ∂~
αL
Table B.1.
145
146 Chapter B. Appendix
Table B.2.
Table B.3.
Table B.4.
B.1. Training without feedback sessions 147
Table B.5.
Table B.6.
148 Chapter B. Appendix
Session 1 2 3 4 5 6
MA2 90 110 70 90 95 85
MAs
MA3 85 75 75 80 85 100
Table B.7.
Session 1 2 3 4 5 6
Table B.8.
a
FN and FP stand for false negative and false positive fractions respectively. See Chapter 6, Section 6.5
for details
B.2. Training with feedback sessions 149
Session 1 2 3 4 5 6
Table B.9. Time spent in reaching the target and experimental bit rate.
Session 1 2 3 4 5 6
Table B.10.
Session 1 2 3 4 5 6
Table B.11.
150 Chapter B. Appendix
Session 1 2 3 4 5 6
Exp. bit rate [bits/min] 18.07 18.78 19.23 19.97 21.24 22.26
Table B.12. Time spent in reaching the target and experimental bit rate.
Session 1 2 3 4 5 6
Table B.13.
Session 1 2 3 4 5 6
Table B.14.
B.2. Training with feedback sessions 151
Session 1 2 3 4 5 6
Table B.15. Time spent in reaching the target and experimental bit rate.
Session 1 2 3 4 5 6
Table B.16.
Session 1 2 3 4 5 6
Table B.17.
152 Chapter B. Appendix
Session 1 2 3 4 5 6
Exp. bit rate [bits/min] 18.43 21.13 21.58 23.17 25.16 26.79
Table B.18. Time spent in reaching the target and experimental bit rate.
Session 1 2 3 4 5 6
Table B.19.
Session 1 2 3 4 5 6
Table B.20.
B.2. Training with feedback sessions 153
Session 1 2 3 4 5 6
Exp. bit rate [bits/min] 27.78 28.64 29.70 32.88 33.71 35.50
Table B.21. Time spent in reaching the target and experimental bit rate.
Session 1 2 3 4 5 6
MA4 48 60 42 48 60 42
Table B.22.
Session 1 2 3 4 5 6
Table B.23.
154 Chapter B. Appendix
Session 1 2 3 4 5 6
Table B.24. Time spent in reaching the target and experimental bit rate.
Bibliography
[2] H. Akaike. Fitting autoregressions for prediction. Annals of the Institute of Statistical
Mathematics, 21:243–247, 1969.
[3] H. Akaike. A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6):716–723, 1974.
[4] M. Akay. Time Frequency and Wavelets in Biomedical Signal Processing. IEEE Press
Series on Biomedical Engineering, 1998.
[6] C.W. Anderson, E.A. Stolz, and S. Shamsunder. Multivariate autoregressive models
for classification of spontaneous electroencephalographic signals during mental tasks.
IEEE Transactions on Biomedical Engineering, 45:277–286, 1998.
[8] A.B. Barreto, S.D. Scargle, and M. Adjouadi. A practical emg-based human-computer
interface for users with motor disabilities. Journal of Rehabilitation Research and
Development, 37(1):53–63, 2000.
[9] A.R. Barron, J. Rissanen, and B. Yu. The mdl principle in modeling and coding.
IEEE Transactions on Information Theory - Special issue commemorating 50 years
of information theory, 44:2743–2760, 1998.
[10] P.J. Bartlett, B. Schölkopf, D. Schuurmans, and A.J. Smola, editors. Advances in
Large-Margin Classifiers (Neural Information Processing). MIT Press, 2000.
155
156 Bibliography
[11] J.D. Bayliss. A Flexible Brain-Computer Interface. PhD thesis, Department of Com-
puter Science University of Rochester, 2001.
[12] J.D. Bayliss. Use of the Evoked Potential P3 Component for Control in a Virtu-
al Apartment. IEEE Transactions Rehabilitation Engineering, 11(2):113–116, June
2003.
[14] J.S. Bendat and G.G. Piersol. Random Data Analysis and Measurement Procedures.
Wiley-Interscience, 2000.
[15] H. Berger. Ueber das elektrenkephalogramm des menschen. Arch. Psichiatr. Ner-
venkr., 87:527–570, 1929.
[19] G.E. Birch, S.G. Mason, and J.F. Borisoff. Current trends in brain-computer interface
research at the Neil Squire foundation. IEEE Transactions on Neural Systems and
Rehabilitation Engineering, 11(2):123–126, 2003.
[20] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
1995.
[21] B. Blankertz, G. Curio, and K.-R. Müller. Advances in Neural Information Processing
Systems (NIPS 01), volume 14, chapter Classifying single trial EEG: Towards brain-
computer interfacing. MIT Press, 2002.
[22] P.J. Brockwell and R.A. Davis. Time Series: Theory and Methods. Springer-Verlag,
second edition, 1996.
[23] C.J.C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
Bibliography 157
[24] C.J.C. Burges. Uniqueness of the SVM Solution. In Proceedings of the Twelfth
Conference on Neural Information Processing Systems. MIT Press, 1999.
[25] G.L. Calhoun, G.R. McMillan, D.F. Ingle, and M.S. Middendorf. Eeg-based control:
Neurologic mechanisms of steady-state self-regulation. Technical Report AL/CF-TR-
1997-0047, Wright-Patterson Air Force Base, 1997.
[27] Z. Chen and S. Haykin. On different facets of regularization theory. Neural Compu-
tation, 14(12):2791–2846, 2002.
[28] M. Cheng, X. Gao, S. Gao, and D. Xu. Design and Implementation of a Brain-
Computer Interface With High Transfer Rates. IEEE Transactions on Biomedical
Engineering, 49(10):1181–1186, 2002.
[29] A. Choppin. EEG-Based Human Interface for Disabled Individuals: Emotion Ex-
pression with Neural Networks. Master’s thesis, Tokyo Institute of Technology -
Department of Information Processing, 2000.
[31] R.J. Croft and R.J. Barry. Removal of ocular artifact from the eeg: a review. Clinical
Neurophysiology, 30:5–19, 2000.
[32] N.E. Crone, D.L. Miglioretti, B. Gordon, J.M. Sieracki, M.T. Wilson, S. Uematsu,
and R.P. Lesser. Functional mapping of human sensorimotor cortex with electro-
corticographic spectral analysis. I. Alpha and beta event-related desynchronization.
Brain, 121:2271–2299, 1998.
[33] F.H. Lopes da Silva. Neural mechanisms underlying brain waves: from neural mem-
branes to networks. Electroencephalography and Clinical Neurophysiology, 79:81–93,
1991.
[36] A. Delorme and S. Makeig. EEG Changes Accompanying Learned Regulation of 12-hz
EEG Activity. IEEE Transactions on Neural Systems and Rehabilitation Engineering,
11(2):133–137, June 2003.
[39] E. Donchin and D.B. Smith. The contingent negative variation and the late positive
wave of the average evoked potential. Electroencephalography and Clinical Neurophys-
iology, 29:201–203, 1970.
[40] E. Donchin, K.M. Spencer, and R.S. Wijesinghe. The mental prosthesis: Assessing the
speed of a p300-based brain-computer interface. IEEE Transactions on Rehabilitation
Engineering, 8:174–179, 2000.
[41] G. Dornhege, B. Blankertz, G. Curio, and K.-R. Müller. Advances in Neural Inf.
Proc. Systems (NIPS 02), volume 15, chapter Combining features for BCI. MIT
Press, 2003.
[42] J. Durbin. The Fitting of Time-series Models. Review of the International Institute
of Statistics, 28:233–243, 1960.
[43] P.J. Durka. Time-ferquency analyses of EEG. PhD thesis, Institute of Experimental
Physics Departement of Physics Warsaw University, August 1996.
[44] J.R. Evans and A. Abarbanel. Introduction to Quantitative EEG and Neurofeedback.
Academic Press, 1999.
[45] E.V. Evarts. Pyramidal tract activity associated with a conditioned hand movement
in the monkey. Journal of Neurophysiology, (29):293–301, 1966.
[46] L.A. Farwell and E. Donchin. Talking off the top of your head: A mental prosthesis
utilizing event-related brain potentials. Electroencephalography and Clinical Neuro-
physiology, (70):510–523, 1998.
[47] E.E. Fetz and D.V. Finocchio. Correlations between activity of motor cortex cells
and arm muscles during operantly conditioned response patterns. Experimental Brain
Research, 3(23):217–240, 1975.
[48] F. Findji, P. Catani, and C. Liard. Topographical distribution of delta rhythms dur-
ing sleep: Evolution with age. Electroencephalography and Clinical Neurophysiology,
51(6):659–665, 1981.
[50] W.J. Freeman. Dynamics of sensory and cognitive processing by the brain, chapter
Nonlinear neural dynamics in olfaction as a model for cognition, pages 19–28. Springer
Verlag, 1988.
[51] W.J. Freeman. Induced rhythms in the brain, chapter Predictions on neocortical
dynamics derived from studies in paleocortex, pages 183–199. Springer Verlag, 1992.
Bibliography 159
[52] J. Freudiger, G.N. Garcia, T. Koenig, and T.Ebrahimi. Brain states analysis for
direct brain-computer communication. Technical report, Swiss Federal Institute of
Technology EPFL, July 2003.
[54] D. Galin and R.F. Ornstein. Lateral specialization of cognitive mode: An EEG study.
Psycophysiology, 9:412–418, 1972.
[55] D. Galin and R.F. Ornstein. Human Behavior and Brain Function, chapter Hemi-
spheric Specialization and the Duality of Consciousness, pages 3–23. Thomas Books,
1975.
[57] G.N. Garcia and T. Ebrahimi. Time-Frequency-Space Kernel for Single EEG-Trial
Classification. In Proceedings of the NORSIG conference, 2002.
[58] G.N. Garcia, T. Ebrahimi, and J.-M. Vesin. Classification of EEG signals in the
ambiguity domain for brain computer interface applications. In Proceedings of the
IEEE International Conference on Digital Signal Processing (DSP), volume 1, pages
301–305, July 2002.
[59] G.N. Garcia, T. Ebrahimi, and J.-M. Vesin. Correlative exploration of EEG signals
for direct brain-computer communication. In Proceedings of the IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP), volume 5, pages
816–819, 2003.
[60] G.N. Garcia, T. Ebrahimi, and J.-M. Vesin. Joint Time-Frequency-Space Classifi-
cation of EEG in a Brain-Computer Interface Application. EURASIP Journal on
Applied Signal Processing, 2003(7):713–729, 2003.
[61] G.N. Garcia, T. Ebrahimi, and J.-M. Vesin. Support vector EEG classification in
the fourier and time-frequency correlation domains. In Proceedings of the First IEEE
EMBS Conference on Neural Engineering, pages 591–594, March 2003.
[62] G.N. Garcia, T. Ebrahimi, J.-M. Vesin, and A. Villca. Direct Brain-Computer Com-
munication with User Rewarding Mechanism. In Proceedings of the IEEE Interna-
tional Symposium in Information Theory (ISIT), pages 221–221, July 2003.
[63] G.N. Garcia, U. Hoffmann, T. Ebrahimi, and J.-M. Vesin. Direct Brain-Computer
Communication through EEG Signals. To appear in IEEE EMBS Book Series on
Neural Engineering, 2004.
160 Bibliography
[64] D. Garret, D.A. Peterson, C.W. Anderson, and M.H. Thaut. Comparison of Lin-
ear, Nonlinear, and Feature Selection Methods for EEG Signal Classification. IEEE
Transactions on Neural Systems and Rehabilitation Engineering, 11(2):141–144, June
2003.
[66] A.A. Glover, M.C. Onofrj, M.F. Ghilardi, and I. Bodis-Wollner. P300-like potentials
in the normal monkey using classical conditioning and the auditory oddball paradigm.
Electroencephalography and Clinical Neurophysiology, 65:231–235, 1986.
[67] R.M. Golden. Digital filter synthesis by sampled-data transformation. IEEE Trans-
actions Audio and Electroacustics, AU-16:321–329, 1968.
[68] I.I. Goncharova, D.J. McFarland, T.M. Vaughan, and J.R. Wolpaw. Eeg-based brain-
computer interface (bci) communication: scalp topography of emg contamination.
Soc Neurosci Abstr, 26:1229, 2000.
[71] S. Gupta and H. Singh. Preprocessing EEG signals for direct human-system interface.
In IEEE International Joint Symposia on Intelligence and Systems (IJSIS), pages 32–
37, November 1996.
[72] C. Harris, X. Hong, and Q. Gan. Adaptive Modelling Estimation and Fusion from
Data. Springer-Verlag, 2002.
[74] K. Hirano, S. Nishimura, and S.K. Mitra. Design of Digital Notch Filters. IEEE
Transactions on Circuits and Systems, 22(7):964–970, 1974.
[75] J.E. Huggins, S.P. Levine, S.L. BeMent, R.K. Kushwaha, L.A. Schuh, M.M. Rohde,
and D.A. Ross. Detection of event related potentials for development of a direct brain
interface. Journal of Clinical Neurophysiology, 16:448–455, 1999.
Bibliography 161
[76] D.R. Humprey. Representation of movements and muscles within the primate pre-
central motor cortex: historical and current perspectives. Federation Proceedings,
12(45):2687–2699, 1986.
[79] H.H. Jasper. The ten-twenty electrode system of the international federation. Elec-
troencephalography and Clinical Neurophysiology, 10(1):371–375, 1958.
[80] H.H. Jasper and W. Penfield. Electrocorticograms in man: effect of the voluntary
movement upon the electrical activity of the precentral gyrus. Arch. Psychiat. Z.
Neurol., 183:163174, 1949.
[84] J. Kamiya. Conditioned discrimination of the EEG alpha rhythm in humans. Paper
presented at the Western Psychological Association, 1962.
[85] S.M. Kay. Modern Spectral Estimation: Theory and Application. Prentice-Hall, 1988.
[86] Z.A. Keirn and J.I. Aunon. Man-Machine Communications Through Brain-Wave
Processing. IEEE Engineering in Medicine and Biology Magazine, 9(1):55–57, 1990.
[87] S Kelly, D. Burke, P. de Chazal, and R. Reilly. Parametric Models and Spectral
analysis for Classification in Brain-Computer Interfaces. In Proceedings of the IEEE
International Conference on Digital Signal Proceesing, 2002.
[88] P.R. Kennedy. The cone electrode: a long-term electrode that records from neurites
grown onto its recording surface. Journal of Neuroscience Methods, (29):181–193,
1989.
[89] P.R Kennedy and K.D. Adams. A decision tree for brain-computer interface devices.
IEEE Transactions on Neural Systems and Rehabilitation Enginnering, 11(2):148–
150, 2003.
162 Bibliography
[90] P.R. Kennedy and R.A. Bakay. Restoration of neural output from a paralyzed patient
by a direct brain connection. NeuroReport, (9):1707–1711, 1998.
[91] P.R. Kennedy, R.A.E. Bakay, M.M. Moore, K. Adams, and J. Goldwaithe. Direct
control of a computer from the human central nervous system. IEEE Transactions
on Rehabilitation Engineering, (8):198–202, 2000.
[92] G.S. Kimeldorf and G.Wahba. Some results on tchebycheffian spline functions. Jour-
nal of Mathematical Analysis and Applications, (33):82–95, 1971.
[93] J. Kivinen, A.J. Smola, and R.C. Williamson. Online Learning with Kernels. Avail-
able at http://citeseer.nj.nec.com/kivinen02online.html, 2002.
[94] T. Koenig, K. Kochi, and D. Lehmann. Event-related electric microstates of the brain
differ between words with visual and abstract meaning. Electroencephalography and
Clinical Neurophysiology, 106(6):535–546, 1998.
[95] H.W. Kuhn and A.W. Tucker. Nonlinear programming. In Proceedings of the 2nd
Berkeley Symposium on Mathematical Statistics and Probabilistics, pages 481–492,
1951.
[96] J.P. Lachaux, E. Rodriguez, J. Martinerie, and F.J. Varela. Measuring Phase Syn-
chrony in Brain Signals. Human Brain Mapping, 8:194–208, 1999.
[97] N. Levinson. The Wiener RMS (Root Mean Square) Error Criterion in Filter Design
and Prediction. Journal of Mathematical Physics, 25:261–278, 1947.
[99] S. Makeig, T.-P. Jung, A.J. Bell, D. Ghahremani, and T.J. Sejnowski. Blind Sep-
aration of Auditory Event-related Brain Responses into Independent Components.
Proceedings of the National Academy of Sciences of the United States of America,
94:10979–10984, 1997.
[100] J. Makhoul. Linear Prediction: A tutorial review. Proceedings of the IEEE, 63:561–
580, 1975.
[101] S.G. Mason and G.E. Birch. A Brain-Controlled Switch for Asynchronous Control
Applications. IEEE Transactions on Biomedical Engineering, 47(10):1297–1307, 2000.
[102] D.J. McFarland, L.M. McCane, S.V. David, and J.R. Wolpaw. Spatial filter selection
for eeg-based communication. Electroencephalography and Clinical Neurophysiology,
103:386–394, 1997.
Bibliography 163
[103] D.J. McFarland, L.M. McCane, and J.R. Wolpaw. EEG-Based Communication and
Control:Short-Term Role of Feedback. IEEE Transactions on Rehabilitation Engi-
neering, 6(1):7–11, 1998.
[104] D.J. McFarland, W.A. Sarnacki, and J.R. Wolpaw. Brain-computer interface (BCI)
operation: optimizing information transfer rates. Biological Psychology, 63:237–251,
2003.
[105] M.S. Middendorf, G.R. McMillan, G.L. Calhoun, and K.S. Jones. Brain-Computer
Interfaces Based on the Steady-State Visual-Evoked Response. IEEE Transactions
on Rehabilitation Engineering, 8:211–214, 2000.
[106] J.d.R. Millan. A Local Neural Classifier for the Recognition of EEG Patterns Asso-
ciated to Mental Tasks. IEEE Transactions on Neural Networks, 13:678–686, 2002.
[107] J.d.R. Millan and J. Mourino. Asynchronous BCI and local neural classifiers: an
overview of the adaptive brain interface project. IEEE Transactions on Neural Sys-
tems and Rehabilitation Engineering, 11(2):159–161, 2003.
[111] F. Mormann, K. Lehnertz, P. David, and C.E. Elger. Mean Phase Coherence as
a Measure for Phase Synchronization and its Application to the EEG of Epilepsy
Patients. Physica D, 144:358–369, 2000.
[112] K.-R. Müller, C.W. Anderson, and G.E. Birch. Linear and Nonlinear Methods for
Brain-Computer Interfaces. IEEE Transactions on Neural Systems and Rehabilitation
Engineering, 11(2):165–169, 2003.
[114] J. Muthuswamy and N.V. Thakor. Spectral analysis methods for neurological signals.
Journal of Neuroscience Methods, 83:1–14, 1998.
[116] A.R. Nikolaev and A.P. Anokhin. Eeg frequency ranges during perception and mental
rotation of two- and three-dimentional objects. Neuroscience and Behavioral Physi-
ology, 6(28):670–677, 1998.
[117] P.L. Nunez, R.B. Silbersteina, Z. Shia, M.R. Carpentera, R. Srinivasana, D.M.
Tuckerb, S.M. Doranc, P.J. Caduschd, and R.S. Wijesinghea. EEG coherency II:
experimental comparisons of multiple measures. Electroencephalography and Clinical
Neurophysiology, 110:469–486, 1999.
[118] P.L. Nunez, R. Srinivasan, A.F. Westdorpa, R.S. Wijesinghea, D.M. Tuckerb, R.B.
Silbersteine, and P.J. Cadusche. EEG coherency I: statistics, reference electrode,
volume conduction, Laplacians, cortical imaging, and interpretation at multiple scales.
Electroencephalography and Clinical Neurophysiology, 103:499–515, 1997.
[121] D.A. Overton and C. Shagass. Distribution of eye movement and eye blink potentials
over the scalp. Electroencephalography and Clinical Neurophysiology, 27:546, 1969.
[122] R.D. Pascual-Marqui, C.M. Michel, and D. Lehmann. Segmentation of brain electrical
activity into microstates: model estimation and validation. IEEE Transactions on
Biomedical Engineering, 42(7):658–665, 1995.
[123] W.D. Penny, S.J. Roberts, E.A. Curran, and M.J. Stokes. EEG-Based Commu-
nication: A Pattern Recognition Approach. IEEE Transactions on Rehabilitation
Engineering, 8(2):214–215, 2000.
[124] J. Perelmouter and N. Birbaumer. A Binary Spelling Interface with Random Errors.
IEEE Transactions on Rehabilitation Engineering, 8(2):227–232, 2000.
[126] G. Pfurtscheller and C. Neuper. Motor Imagery and Direct Brain-Computer Com-
munication. Proceedings IEEE, 89:1123–1134, 2001.
[130] D.A. Pierre. Optimization Theory with Applications. Dover Pubns, 1987.
[132] J.A. Pineda, B.Z. Allison, and A. Vankov. The Effects of Self-Movement, Observation,
and Imagination on Rhythms and Readiness Potentials (RPs): Toward a BrainCom-
puter Interface (BCI). IEEE Transactions on Neural Systems and Rehabilitation
Engineering, 8(2):219–222, 2000.
[134] M.B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981.
[135] J.G. Proakis and D. Manolakis. Digital Signal Processing: Principles, Algorithms and
Applications. Prentice Hall, 1995.
[136] P.Stoica and R. Moses. Introduction to Spectral Analysis. Prentice Hall, 1988.
[144] V.S. Rotenberg and V.V. Arshavsky. Right and left brain hemispheres activation
in the representatives of two different cultures. Homeostasis in Health & Disease,
38(2):49–57, 1997.
[147] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, 2002.
[149] M.D. Serruya, N.G. Hatsopoulos, L. Panininski, M.R. Fellows, and J.P. Donoghue.
Instant neural control of a movement signal. Nature, 416:141–142, 2002.
[152] W. Singer. Synchronization of cortical activity and its putative role in information
processing and learning. Annual Review of Physiology, 55:349–374, 1993.
[155] S. Sutton, M. Braren, J. Zubin, and E.R. John. Evoked correlates of stimulus uncer-
tainty. Science, 150:1187–1188, 1965.
[157] D.M.J. Tax. One-class classification. PhD thesis, Technische Universiteit Delft, 2001.
Bibliography 167
[158] J.J. Tecce, J. Gips, C.P. Olivieri, L.J. Pok, and M.R. Consiglio. Eye movement control
of computer functions. International Journal of Psychophysiology, 29:319–325, 1998.
[159] S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system.
Nature, pages 520–522, 1996.
[161] L.J. Trejo, K.R. Wheeler, C.C. Jorgensen, R. Rosipal, S.T. Clanton, B. Matthews,
A.D. Hibbs, R. Matthews, and M. Krupka. Multimodal Neuroelectric Interface De-
velopment. IEEE Transactions on Neural Systems and Rehabilitation Engineering,
11:199–204, 2003.
[162] D.M. Tumey, P.E. Morton, D.F. Ingle, C.W. Downey, and J.H. Schnurer. Neural
Network Classification of EEG using Chaotic Preprocessing and Phase Space Recon-
struction. In Proceedings of the Seventeenth IEEE Annual Northeast Bioengineering
Conference, 1991.
[163] W.R. Utall. The War Between Mentalism and Behaviorism: On the Accessibility of
Mental Processes. NJ:Erlbaum, 1999.
[164] M. van de Velde, G. van Erp, and P. J. M. Cluitmans. Detection of muscle artefact in
the normal human awake EEG. Electroencephalography and Clinical Neurophysiology,
107(2):149–158, April 1998.
[165] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[167] T.M. Vaughan, W.J. Heetderks, L.J. Trejo, and W.Z. Rymer. Brain-Computer Inter-
face Technology: A Review of the Second International Meeting. IEEE Transactions
on Neural Systems and Rehabilitation Engineering, 11(2):94–109, June 2003.
[170] G. Walker. On periodicity in Series of Related Terms. Proceedings of the Royal Society
(London) A, 131:518–532, 1931.
[171] W.G. Walter, R. Cooper, V.J. Aldridge, W.C. McCallum, and A.L. Winter. Contin-
gent negative variation: an electric sign of sensorimotor association and expectancy
in the human brain. Nature, 203:380–384, 1964.
168 Bibliography
[172] P.D. Welch. The Use of Fast Fourier Transform for the Estimation of Power Spectra:
A Method Based on Time Averaging Over Short, Modified Periodograms. IEEE
Transactions on Audio Electroacoustics, AU-15:70–73, 1967.
[173] P. Whittle. On the fitting of multivariate autoregressions, and the approximate canon-
ical factorization of a spectral density matrix. Biometrika, 50:129–134, 1963.
[175] J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, and T.M. Vaughan.
Brain-computer interfaces for communication and control. Clinical Neurophysiology,
113:767–791, 2002.
[176] J.R. Wolpaw, D.J. McFarland, and T.M. Vaughan. Brain-computer interface research
at the Wadsworth Center. IEEE Transactions on Neural Systems and Rehabilitation
Engineering, 8:222–226, 2000.
[177] J.R. Wolpaw, H. Ramoser, D.J. McFarland, and G. Pfurtscheller. Eeg-based commu-
nication: improved accuracy by response verification. IEEE Transactions on Reha-
bilitation Engineering, 6(3):326–333, 1998.
[179] A.R. Wyler and K.J. Burchiel. Factors influencing accuracy of operant conditioning
of tract neurons in monkey. Brain Research, (152):418–421, 1978.
[180] A.R. Wyler, K.J. Burchiel, and S.A. Robbins. Operant control of precentral neurons
in monkeys: evidence against open loop control. Brain Research, (171):29–39, 1979.
[182] G.U. Yule. On a Method for Investigating Periodicities in Disturbed Series with
Special Reference to Wolfer’s Sunspot Numbers. Philosophical Transactions of the
Royal Society of London Series A-Mathematical and Physical Sciences, 226:267–298,
1927.