ELM Tutorial

Extreme Learning Machines (ELM)
– Filling the Gap between Frank Rosenblatt's Dream and

John von Neumann's Puzzle?
（超限学习机 - 填补Frank Rosenblatt梦想到John von
Neumann的困惑之间的空白?）
Guang-Bin Huang (黄广斌)

School of Electrical and Electronic Engineering
Nanyang Technological University, Singapore
(新加坡南洋理工大学)
A Brief Tutorial
Outline
• Part I - ELM Philosophy and Generalized SLFN Cases:

– Neural networks and machine learning history
– Rethink machine learning and artificial intelligence
– Philosophy and belief of Extreme Learning Machines (ELM)
• Do we really need so many different type of learning algorithms for
so many type of networks (various types of SLFNs, regular and
irregular multi-layers of networks, various type of neurons)?
• Can the gap between machine learning and biological learning be
filled?
• Should learning be transparent or of blackbox?
• SVM provides suboptimal solutions.
– Machine learning and Internet of Things
– Machine intelligence and human intelligence
2
Outline
• Part II – Hierarchical ELM

– Unsupervised/semi supervised ELM
– Feature learning
– Hierarchical ELM
– ELM + (other algorithms)
• Part III – ELM Theories and Open Problems
– ELM theories:
• Universal approximation capability
• Classification capability
– Incremental learning
– Online sequential learning
– Open problems 3
Part I
ELM Philosophy and Generalized

SLFN Cases
Frank Rosenblatt: Perceptron
• Cognition Dream in 60 Years Ago …
– “Rosenblatt made statements about the perceptron

that caused a heated controversy among the
fledgling AI community.”
– Cognition: “Based on Rosenblatt's statements, The

New York Times reported the perceptron to be
"the embryo of an electronic computer that [the
Navy] expects will be able to walk, talk, see, write,
reproduce itself and be conscious of its existence”
http://en.wikipedia.org/wiki/Perceptron
5
Perceptron and AI Winter
• “AI Winter” in 1970s

– “Beautiful mistakes” [Minsky and Papert 1969]: Minsky claimed in his book that
the simple XOR cannot be resolved by two-layer of feedforward neural
networks, which “drove research away from neural networks in the
1970s, and contributed to the so-called AI winter.” [Wikipedia2013]
6
Three Waves of Machine Learning
2010 – Present: Data

driven
Features: computers
1980s-2010: Research powerful enough,
driven powerful and smart
computing
sensors/devices
Features: computers everywhere, huge data
very powerful, many coming. Efficient
efficient algorithms algorithms under way
developed, no enough
data in many cases Situation; No matter
1950s-1980s: Warm up you admit or not, we
Situation: more driven have to rely on machine
by researchers instead of learning from now on
Features: computers not industries
powerful, no efficient
algorithms, no enough
data
Situation: Chinese
people already had good
dream since the inception
of computers and called
computers as “Electronic
Brains （电脑）” 7
Rethink Artificial Intelligence and
Machine Learning
Machine Learning
Artificial Intelligence
ELMs’ direct
Neural Networks reviving biological
Almost all Deep Learning evidence found
Rosenblatt’s (CNN, BP, etc) techniques in 2012
perceptron proposed proposed in 1980s
in 1950s AI Winter ELMs born in 2004
(1970s) Deep Learning reviving in 2004
SVM proposed
due to high performance of computing
in 1990s
1950s 1970s 1980s 2010 Present Time
8
Necessary Conditions of Machine
Learning Era
Rich
dynamic
Powerful data
computing
environment
Efficient
learning
algorithms
Three necessary conditions of true

machine learning era, which have
been fulfilling since 2010
9
Feedforward Neural Networks
oj
Output of additive hidden nodes:
Output Node , , ·
1 i L
Output of RBF hidden nodes:
1 i L L Hidden Nodes , ,
(ai , bi )
The output function of SLFNs is:

1 n n Input Nodes
xj , ,
g(x) g(x) g(x)

: Output weight vector
connecting the th hidden node and
the output nodes
1 1 1
x x x
(a) (b) (c)
10
• Mathematical Model
– Approximation capability [Leshno 1993, Park and Sandberg 1991]: Any
continuous target function can be approximated by SLFNs
with adjustable hidden nodes. In other words, given any small
positive value , for SLFNs with enough number of hidden nodes
( ) we have .
– Classification capability [Huang, et al 2000]: As long as SLFNs can
approximate any continuous target function , such SLFNs can
differentiate any disjoint regions.
M. Leshno, et al., “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural
Networks, vol. 6, pp. 861-867, 1993.
J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,” Neural Computation, vol. 3, pp. 246-257,
1991.
G.-B. Huang, et al, “Classification ability of single hidden layer feedforward neural networks,” IEEE Trans. Neural Networks, vol. 11,
no. 3, pp. 799–801, May 2000.
11
• Learning Issue
– Conventional theories: only resolves the existence issue, however,
does not tackle learning issue at all.
– In real applications, target function is usually unknown. One
wishes that unknown could be approximated by SLFNs
appropriately.
12
• Learning Methods
– Many learning methods mainly based on gradient-descent / iterative
approaches have been developed over the past three decades.
• Back-Propagation (BP) [Rumelhart 1986] and its variants are most popular.
– Least-square (LS) solution for RBF network, with single impact factor for
all hidden nodes. [Broomhead and Lowe 1988]
– QuickNet (White, 1988) and Random vector functional network (RVFL) [Igelnik
and Pao 1995]
– Support vector machines and its variants. [Cortes and Vapnik 1995]
– Deep learning: dated back to 1960s and resurgence in mid of 2000s [wiki
2015]
13
Support Vector Machine – an
Alternative Solution of SLFN
SVM optimization formula
1
minimize:
2
b subject to: · 1 ,∀
0, ∀
1 s Ns LS-SVM optimization formula

, , , 1 1
minimize:
2 2
subject to: · 1 ,∀
1 n
x
The decision function of SVM and LS-SVM is:
Typical kernel function: sign ,

, exp
14
• Interesting 20 Years of Cycles

– Robenblatt’s Perceptron proposed in mid of 1950s, sent to “Winter” in
1970s
– Back-Propagation (BP) proposed in 1970s, reaching research peak in mid
of 1990s
– Support vector machines proposed in 1995, reaching research peak early
this century.
• There are exceptional cases:

– E.g, most deep learning algorithms proposed in 1960s ~1980s,
becoming popular only since 2010 (more or less)
15
Research in Neural Networks Stuck …?
Conventional Learning Methods Biological Learning
Very sensitive to network size Stable in a wide range (tens to thousands
of neurons in each module)
Difficult for parallel implementation Parallel implementation
Difficult for hardware implementation “Biological” implementation
Very sensitive to user specified parameters Free of user specified parameters
Different network types for different type of One module possibly for several
applications types of applications
Time consuming in each learning point Fast in micro learning point
Difficult for online sequential learning Nature in online sequential learning

“Greedy” in best accuracy Fast speed and high accuracy
“Brains (devised by conventional learning Brains are built before applications
methods)” are chosen after applications are
present
16
Research in Neural Networks Stuck …?
• Reasons
– Based on the conventional existence theories:
• Since hidden nodes are important and critical, we need to find some
way to adjust hidden nodes.
• Learning focuses on hidden nodes.
• Learning is tremendously inefficient.
– Intensive research: many departments/groups in almost every
university/research institution have been spending huge
manpower on looking for so-called “appropriate” (actually still
very basic) learning methods in the past 30 years.
• Question
– Is free lunch really impossible?
– The answer is “seemingly far away, actually close at hand and
right under nose” “远在天边, 近在眼前” 17
Fundamental Problems to Be Resolved
by Extreme Learning Machines (ELM)
• Do we really need so many different types of learning
algorithms for so many different types of networks?
– different types of SLFNs
• sigmoid networks
• RBF networks
• polynomial networks
• complex (domain) networks
• Fourier series
• wavelet networks, etc
– multi-layers of architectures
• Do we really need to tune wide type of hidden neurons
including biological neurons (even whose modeling is
18
unknown) in learning?
Output function of “generalized” SLFNs:
, ,
1 i L Feature learning
Clustering
Problem based Regression
optimization constraints Classification
The hidden layer output function (hidden
1 i L layer mapping, ELM feature space):
(ai , bi ) , , ,⋯, , ,
1 d
The output functions of hidden nodes can be
xj but are not limited to:
Random Hidden Neurons (which need not be algebraic Sigmoid: , , ·

sum based) or other ELM feature mappings. Almost any
nonlinear piecewise continuous hidden nodes: RBF: , ,
, , Fourier Series: , , cos ·
Although we don’t know biological neurons’ true output
functions, most of them are nonlinear piecewise
continuous, covered by ELM theories. 19
• New Learning Theory - Learning Without Iteratively Tuning Hidden

Neurons in general architectures: Given any nonconstant piecewise
continuous function , if continuous target function can be
approximated by SLFNs with adjustable hidden nodes then the
hidden node parameters of such SLFNs needn’t be tuned. [Huang, et al 2006,
2007]
– It not only proves the existence of the networks but also provides learning
solutions.
– All these hidden node parameters can be randomly generated without training data.
– That is, for any continuous target function and any randomly generated
sequence , , lim x lim ∑ , , 0 holds
→ →
with probability one if is chosen to minimize , ∀ . [Huang, et al 2006]
• Direct biological evidence later found in 2013 [Fusi, 2013]
G.-B. Huang, et al., “Universal approximation using incremental constructive feedforward networks with random hidden nodes,”
IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.
G.-B. Huang and L. Chen, “Convex Incremental Extreme Learning Machine,” Neurocomputing, vol. 70, pp. 3056-3062, 2007.
O. Barak, et al, "The importance of mixed selectivity in complex cognitive tasks," Nature, vol.497, pp. 585-590, 2013
M. Rigotti, et al, "The sparseness of mixed selectivity neurons controls the generalization-discrimination trade-off," Journal of 20
Neuroscience, vol. 33, no. 9, pp. 3844-3856, 2013

layer mapping, ELM feature space):
ELM random feature
mapping: , , ,⋯, , ,

but are not limited to
Sigmoid: , , ·
x
RBF: , ,
Fourier Series: , , cos ·
Input space ELM feature space
Conventional Random Projection is just a

Random Hidden Neurons (which need not be algebraic specific case of ELM random feature mapping
sum based) or other ELM feature mappings. Almost any
nonlinear piecewise continuous hidden nodes:
(ELM feature space) when linear additive
, , hidden node is used.
Although we don’t know biological neurons’ true output Random Projection: , , ·
functions, most of them are nonlinear piecewise
continuous, covered by ELM theories. 21

layer mapping, ELM feature space):
ELM random feature
mapping: , , ,⋯, , ,

but are not limited to
Sigmoid: , , ·
x
RBF: , ,
Fourier Series: , , cos ·
Input space ELM feature space Convolutional nodes
Almost any nonlinear piecewise continuous hidden

nodes: , , , including sigmoid Conventional Random Projection is just a
networks, RBF networks, trigonometric, networks, specific case of ELM random feature mapping
threshold networks, fuzzy inference systems, fully (ELM feature space) when linear additive
complex, neural networks, high-order networks, ridge hidden node is used.
polynomial networks, wavelet networks, convolutional
neural networks, etc.. Random Projection: , , · 22
• Essence of ELM
– Hidden layer need not be tuned.
• “randomness” is just one of ELM’s implementation, but not all
• Some conventional methods adopted “semi-randomness”
– Hidden layer mapping satisfies universal approximation

conditions.
– Minimize: and
• (norm and could have different values, 1, , 2, ⋯)
– It satisfies both ridge regress theory [Hoerl and Kennard 1970] and neural
network generalization theory [Bartlett 1998].
– It fills the gap and builds bridge among neural networks, SVM,
random projection, Fourier series, matrix theories, linear systems,
etc.
23
Basic ELM – a L2 Norm Solution
• Three-Step Learning Model [Huang, et al 2004, 2006]
Given a training set , ∈ , ∈ , 1, ⋯ , , hidden

node output function , , , and the number of hidden nodes ,
1) Assign randomly hidden node parameters , , 1, ⋯ , .
2) Calculate the hidden layer output matrix ⋮ .
3) Calculate the output weights .
ELM Web portal: www.extreme-learning-machines.org

24
• Salient Features
– “Simple Math is Enough.” ELM is a simple tuning-free three-step
algorithm.
– The learning speed of ELM is extremely fast.
– Unlike conventional existence theories, the hidden node parameters
are not only independent of the training data but also of each other.
Although hidden nodes are important and critical, they need not
be tuned.
– Unlike conventional learning methods which MUST see the
training data before generating the hidden node parameters, ELM
could generate the hidden node parameters before seeing the training
data.
– Homogenous architectures for compression, feature learning,
clustering, regression and classification.
25
• Ridge regression theory based ELM
and
• Equivalent ELM optimization formula

Minimize: ∑
subject to: ,∀
26
• Valid for both kernel and non-kernel learning

– Non-kernel based:
and
,
– Kernel based (if is unknown): ⋮
,
where ,
x · ,
27
Image Super-Resolution by ELM
From top to down: super-resolution at 2x and 4x. State-of-the-art methods: iterative curve based
interpolation (ICBI), kernel regression based method (KR), compressive sensing based sparse
representation method (SR). [An and Bhanu 2012]
28
Automatic Object Recognition
Object ELM Based AdaBoost Joint Boosting Scale-Invariant

Categories Based Learning
Bikes 94.6 93.4 92.5 73.9
Planes 95.3 90.0 90.2 92.7
Cars 99.0 96.0 90.3 97.0
Leaves 98.3 94.2 - 97.8
Faces 97.9 98.0 96.4 -
[Minhas, et al 2010]
Sample images from CalTech database

29
Real Operation of Wind Farms
Situation of the wind measuring towers in Spain and within the eight wind farms. Wind speed prediction in tower 6 of the
considered wind farm in Spain obtained by the ELM network (prediction using data from 7 towers). (a) Best prediction
obtained and (b) worst prediction obtained. [Saavedra-Moreno, et al, 2013]
30
Electricity Price Forecasting
Average results of market clearing prices (MCP) forecast by ELM in winter: Trading in the Australian
national electricity market (NEM) is based on a 30-min trading interval. Generators submit their offers every 5 min
each day. Dispatch price is determined every 5 min and 6 dispatch prices are averaged every half-hour to determine
the regional MCPs. In order to assist decision-making process for generators, there are totally 48 MCPs needed to
31
be predicted at the same time for the coming trading day. [Chen, et al, 2012]
Remote Control of a Robotic Hand
• An eight wrist motions offline
classification using linear
support vector machines with
little training time (under 10
minutes).
• This study shows human could
control the remote side robot
hand in real-time using his or
her sEMG signals with less than
50 seconds recorded training
data with ELM.[Lee, et al 2011]
32
Human Action Recognition
[Minhas, et al 2012]
33
3D Shape Segmentation and Labelling
Training time shortened from 8 hours

(conventional methods) to 15 seconds
(ELM solution) for a dataset with 6
meshes with about 25~30K faces.
[Xie, et al 2014] 34
Constraints of BP and SVM Theory
• Both PDP Group and V. Vapnik have made great

contributions in neural networks R&D
– Without PDP Group’s work on BP in 1986, neural networks might
not have revived in 1980’s.
– Without Vapnik’s work on SVM in 1995, neural networks might
have disappeared although many SVM researchers do not
consider SVM a kind of solutions to the traditional neural
networks.
– Without SVM, many applications in pattern recognition, HCI,
BCI, computational intelligence and machine learning, etc, may
not have appeared and been so successful.
• However, …
35
Constraints of BP and SVM Theory
• However, both BP and SVM over-emphasize some

aspects of learning and overlook the other aspects, and
thus, both become incomplete in theory:
– BP gives preference on training but does not consider the stability
of the system (consistency of minimum norm of weights in neural
networks, linear system, and matrix theory)
– SVM confines the research in the maximum margin concept
which limits the research in binary classification and does not
have direct and efficient solutions to regression and multi-class
applications. The consistency between maximum margin,
minimum norm of weights in neural networks and matrix theory
has been overlooked.
36
Essential Considerations of ELM
High
Accuracy
Least User
Intervention
Real-Time
Learning
(in seconds,
milliseconds, even
microseconds)
37
ELM for Threshold Networks
• Binary / Threshold node:
• Threshold networks (in fact approximated by sigmoid

networks in literature) were usually trained by BP and its
variants indirectly in the past three decades. There was no
direct learning solution to threshold networks in the past 60
years.
• Threshold unit can be approximated by sigmoid unit
with sufficiently large gain parameter
• With ELM, threshold networks can be trained directly.
38
ELM for Threshold Networks
39
ELM for Complex Networks
• Circular functions:
– tan , sin
• Inverse circular functions:

– arctan , arcsin / , arccos /
• Hyperbolic functions:
– tanh , sinh
• Inverse hyperbolic functions:

– arctanh , arcsinh /
40
• Wireless Communication Channel Equalizer

– Channel model with nonlinear distortion for 4-QAM signals.
0.1 0.05 , ~ 0, 0.01
0.34 0.27 0.87 0.43 0.34 0.21
Complex activation function used in ELM: tanh , where

·
41
• Wireless Communication Channel Equalizer

– Channel model with nonlinear distortion for 4-QAM signals.
Eye diagram of the outputs of different

equalizers (a) C-ELM (ELM with
complex hidden nodes), (b) CBP
(Complex valued BP), (c) CMRAN
(Complex valued MRAN), (d) CRBF
(Complex valued RBF).
42
• Save Energy in Wireless Communication
SER versus SNR: (a) Linear DFE. (b)

Volterra DFE. (c) Bilinear DFE. (d)
represents average ESN (Echo State
Network) performance with randomly
generated reservoirs. (e) indicates
performance of best network chosen
from the networks averaged in (d).
From H. Jaeger and H. Haas, Science,
vol. 404, pp. 78-80, 2004.
Compared with ESN, ELM reduces the error rate by 1000 times or
above. 43
Why SVM / LS-SVM Are
Suboptimal
Optimization Constraints of ELM and
LS-SVM
• ELM: Based on Equality Constraint Conditions [Huang, et al 2012]

– ELM optimization formula:
Minimize: ∑
subject to: ,∀
– The corresponding dual optimization problem:
Minimize: ∑ ∑ ∑
subject to: , , 0, ∀
45
LS-SVM
• LS-SVM: Based on Equality Constraint Conditions [Suykens and

Vandewalle 1999]
– LS-SVM optimization formula:

Minimize: ∑
In LS-SVM optimal are found
– The corresponding dual optimization problem: from one hyper plane
∑ 0
Minimize: ∑ ∑ ·
1
subject to:
, , · 1 0, ∀
0
46
SVM
• ELM: Based on Inequality Constraint Conditions [Huang, et al

2010]
– ELM optimization formula:

Minimize: ∑
subject to: 1 ,∀
0, ∀
Minimize: ∑ ∑ · ∑
subject to: 0 ,∀
47
SVM
• SVM: Based on Inequality Constraint Conditions [Cortes and

Vapnik 1995]
N
– SVM optimization formula:
C
Minimize: ∑
i
C
0, ∀ C 1
In SVM optimal are found from
one hyper plane ∑ 0
Minimize: ∑ ∑ · ∑
subject to: 0 ,∀
∑ 0
48
SVM
N N
C C
i i
C C
C 1 C 1
ELM’s inequality constraint variant [Huang, et al 2010] SVM
ELM (based on inequality constraint conditions) and SVM have the same dual optimization
objective functions, but in ELM optimal are found from the entire cube 0, while in SVM
optimal are found from one hyperplane ∑ 0 within the cube 0, . SVM always
provides a suboptimal solution, so does LS-SVM.
49
SVM’s Suboptimal Solutions
• Reasons
– SVM’s historical role is irreplaceable! Without SVM and Vapnik,
computational intelligence may not be so successful and the history of
computational intelligence would be re-written! However ...
– SVM always searches for the optimal solution in the hyperplane
∑ 0 within the cube 0, of the SVM feature space.
– SVMs may apply similar application-oriented constraints to
irrelevant applications and search similar hyper planes in feature
space if their target labels are similar. Irrelevant applications may
become relevant in SVM solutions.
[Huang, et al 2010]
N
i
C
50
C 1
• Reasons
– SVM is too “generous” on the feature mappings and kernels,
almost condition free except for Mercer’s conditions.
1) As the feature mappings and kernels need not satisfy universal
approximation condition, must be present.
2) As exists, contradictions are caused.
3) LS-SVM inherits such “generosity” from the conventional SVM
51
· 0 =0
2 2
· 1 =+1
Origin of · 1 Origin of =-1

SVM feature space ELM feature space
As SVM was originally proposed for classification, universal approximation capability was not
considered at the first place. Actually the feature mappings are unknown and may not
satisfy universal approximation condition, must be present to absorb the system error. ELM
was originally proposed for regression, the feature mappings are known and universal
approximation capability was considered at the first place. In ELM the system error tends to be
zero and should not be present.
52
• Maximum margin?
– Maximum margin is good to binary classification cases. However,
if only considering maximum margin, one may not be able to
imagine “maximum margin” in multi-class / regression
problems.
– To over-emphasize “maximum margin” makes the SVM research
deadlock in binary classification and difficult to find the direct
solution to multi-class applications
– “Maximum margin” is just a special case of ridge regression
theory, linear system stability, and neural network generalization
performance theory in binary applications.
• ELM integrates the ridge regression theory, linear system stability,
and neural network generalization performance theory for
regression and multiclass applications, “maximum margin” is just a
special case in ELM’s binary applications.
53
• Data distortion in multi-class classifications?

– Different from ELM, SVM and LS-SVM do not have direct
solutions to multi-class applications. Usually SVM and LS-SVM
use One-Against-One (OAO) or One-Against-All (OAA) methods
to handle multi-class applications indirectly, which may distort
applications.
G.-B. Huang, et al., “Extreme learning machine for regression and multiclass classification”, IEEE Transactions on Systems, Man
and Cybernetics - Part B, vol. 42, no. 2, pp. 513-529, 2012.
G.-B. Huang, “An Insight into Extreme Learning Machines: Random Neurons, Random Features and Kernels”, Cognitive
Computation, 2014.
54
ELM and SVM
(a) SVM
Binary Output
d Input Nodes
Unknown features in each layer, black box, lose layer wise information
ELM
d Input Nodes ELM Feature Space ELM Feature Space m Output Nodes
55
Layer wide features are learned, white box
Relationship and Difference Between
ELM and SVM/LS-SVM
56
ELM vs QuickNet / RVFL
oj
1 i L
Problem based
optimization constraints
d 1 dL 1 i L
(ai , bi )
Enhanced Patterns
(specific ELM feature mapping
such as sigmoid nodes and
1 d RBF nodes) 1 d
xj xj
QuickNet (1989, not patented) / RVFL (1994, patented) ELM (not patented)
, · , ,
Mainly on sigmoid and RBF nodes, not applicable to kernels learning Proved on general cases: any piecewise continuous nodes. ELM theories
extended to biological neurons whose mathematical formula is even unknown
Not feasible for multi-layer of RVFL, losing learning in auto-encoder and Efficient for multi-layer of ELM, auto-encoder, and feature learning, PCA and
feature learning. RVFL and PCA/Random project are different random projects are specific cases of ELM when linear neurons are used.
If ELM’s optimization is used in QuickNet (1988) / RVFL and Schimidt Regularization of output weights, ridge regression theories, neural networks
(1992), a suboptimal solution tends to be achieved. generalization performance theories (maximal margin in binary class cases),
SVM and LS-SVM provide suboptimal solutions.
Hidden layer output matrix: [HELM for Sig or RBF, X N x d] Hidden layer output matrix: HELM for almost any nonlinear piecewise neurons
Homogenous architectures for compression, feature learning, clustering, 57
regression and classification
Relationship and Difference Between
ELM and QuickNet/RVFL, Duin’s Work
G.-B. Huang, “What are Extreme Learning Machines? Filling the Gap between Frank Rosenblatt’s Dream and John von
Neumann’s Puzzle”, Cognitive Computation, vol. 7, pp. 263-278, 2015. 58
Part II
Hierarchical ELM
- Layer-wise learning
- but learning without iteratively tuning hidden neurons
- output weights analytically calculated by closed-forms solutions in many
applications
Multi-Layer ELM
d Input Nodes L Hidden Nodes m Output Nodes
ELM Feature Mapping ELM Learning

/ ELM Feature Space
Different from Deep Learning, All the hidden neurons in ELM as
a whole are not required to be iteratively tuned 60
ELM as Auto-Encoder (ELM-AE)
Features represented by the output weights of

ELM-AE of MNIST OCR Datasets (with 60000
training samples and 10000 testing samples)
61
(a) ELM (b) SVD
ELM-AE vs. singular value decomposition. (a) The output weights of ELM-AE and (b) rank 20
SVD basis shows the feature representation of each number (0–9) in the MNIST dataset.
62
ELM-AE based multi-Layer ELM (ML-ELM): Different from Deep Learning, no iteration is
required in tuning the entire multi-layer feedforward networks
63
ELM vs Deep Learning
Learning Methods Testing Training Time

Accuracy
H-ELM [Chenwei Deng, et al, 2015] 99.14 281.37s
Multi-Layer ELM (784-700- 99.03 . 444.7s

700-15000-10) [Huang, et al 2013]
Deep Belief Networks (DBN) 98.87 20580s
(748-500-500-2000-10) (5.7 hours) 80000
Significant training time gap
70000
Deep Boltzmann Machines 99.05 68246s
(DBM) (784-500-1000-10) (19 hours) 60000
50000
Stacked Auto Encoders (SAE) 98.6 > 17 hours
40000
Stacked Denoising Auto 98.72 > 17 hours 30000
Encoders (SDAE) 20000
[Huang, et al 2013] 10000
0
ELM DBN DBM
Training time (s)
L. L. C. Kasun, et al, “Representational Learning with Extreme Learning Machine for Big Data,” IEEE Intelligent Systems, vol. 28,
no. 6, pp. 31-34, 2013.
J. Tang, et al, “Extreme Learning Machine for Multilayer Perceptron,” (in press) IEEE Transactions on Neural Networks and 64
Learning Systems, 2015.
Human Action Recognition
Methods ELM Tensor canonical Tangent bundles on
Based correlation special manifolds
Accuracies 99.4 85 93.4
[Deng, et al 2015]
Conventional: Heterogeneous combinations

Feature
Classifier (SVM,
extraction (PCA,
etc)
etc)
Conventional: Homogeneous combinations

Feature Learning
ELM as
by ELM Auto-
Classifier
Encoder
65
Target Tracking
Feature Extraction
Frame (n) Sampling Online Sequential Updating
(Multilayer Encoding)
Updating OS-ELM
Frame (n+1) Auto-adaptive Detection System Tracking Result
Circle Feature Dector Classication

Sampling (ELM autoencoder) (OS-ELM)
J. Xiong, et al, “Extreme Learning Machine for Multilayer Perceptron”, IEEE Transactions on Neural Networks and Learning
Systems, 2015. 66
Target Tracking
ELM
Compressive Tracking (CT)
Stacked Autoencoder (SDA)
ELM
Compressive Tracking (CT)
Stacked Autoencoder (SDA)
Systems, 2015. 67
Target Tracking
Comparison of tracking location error using H-ELM, CT, and SDA on different data sets. (a) David Indoor. (b) Trellis
68
Car Detection
Methods ELM Contour based SDA
Based learning
Accuracies 95.5 92.8 93.3
Time 46.78 s 3262.30 s
[Deng, et al 2014]
Extracted Window Normalized and

(40 by 100 pixels) Histogram Equalized
Sliding Sampling Detected Result
Network Input Hidden Units Output
Image Preparation Detection Network
Systems, 2015. 69
Learning Methods Testing Accuracy Training Time
ELM-AE 86.45 602s
3D ShapeNets (Convolutional Deep Belief Network) 86.5 Two days

[Kai XU, Zhige XIE, NUDT, personal communication 2014]
Princeton/MIT/CUHK’s 3D ShapeNets for 2.5D Object Recognition and Next-Best-View Prediction

[Wu, et al 2014]
70
ELM Theory on Local Receptive Fields
and Super Nodes
d Input Nodes L Hidden Nodes m Output Nodes

/ ELM Feature Space (minimal norm of output weights)
(without tuning hidden nodes)
Random hidden node Random connections

(random hidden parameters) 71
ELM Theory on Local Receptive Fields
and Super Nodes
Local Receptive Field
Pooling
Size
Random Input i
Weights Vector
ak
Input Layer Feature Map k Pooling Map k
Convolutional nodes and pooling are one of local

receptive fields in ELM, but there may have many
more.
Similar to sigmoid nodes in feedforward networks,

RBF nodes in RBF networks, etc, convolutional
nodes in CNN can be considered one type of
nonlinear piecewise hidden nodes used in ELM
(Super) Hidden Node i

72
ELM Theory on Local Receptive
Fields and Super Nodes
• Learned from Jürgen Schmidhuber and Dong Yu in
INNS BigData, San Francisoc, August 10-12, 2015, and
Deep Learning in wiki
– First Deep NNs: Ivakhnenko, et al, 1965
– Basic CNN: Fukushima 1979
– Back propagation applied to CNNs: LeCun, et al, 1989
• Mainly on MNIST OCR, but need to spend 3 days
– Max-Pooling: Weng 1992
• ELM learning algorithms can also be applied to CNN so
that tuning hidden neurons are not required, and
meanwhile ELM naturally provides theoretical support
to and underpin CNN and Max-Pooling. [Huang, et al, 2007, 2015]
G.-B. Huang and L. Chen, “Convex Incremental Extreme Learning Machine,” Neurocomputing, vol. 70, pp. 3056-3062, 2007.
G.-B. Huang, et al, “Local Receptive Fields Based Extreme Learning Machine,” IEEE Computational Intelligence Magazine, vol. 73
10, no. 2, pp. 18-29, 2015.
Learning Methods Testing

Accuracy
ELM 97.3%
Tiled Convolutional Neural Nets 96.1%
Convolutional Neural Nets 94.4%
3D DBNs 93.5%
DBMs 92.8%
NORB Dataset
SVMs 88.4% Training time in
NORB Data
DBN ELM
13 h 0.1h
G.-B. Huang, et al, “Local Receptive Fields Based Extreme Learning Machine,” IEEE Computational Intelligence Magazine,
vol. 10, no. 2, pp. 18-29, 2015. 74
Learning Methods Testing

Error Rate
ELM 0.02%
Convolutional Neural Nets 28.51%
(CNN)
CNN+video (test images of 7.75%
COIL)
CNN++video (COIL-like 20.23%
images)
COIL Dataset: 1800 training samples, 5400

testing samples, 100 categories
Z. Bai, et al, “Generic Object Recognition with Local Receptive Fields Based Extreme Learning Machine,” 2015 INNS
Conference on Big Data, San Francisco, August 8-10, 2015. 75
ELM Slices
d Input Nodes ELM Feature ELM Feature m Output Nodes
d Input Nodes ELM Feature

76
Speech Emotion Recognition (DNN +
ELM)
Microsoft Research and Ohio State University [Han, et al 2014]
77
Traffic Sign Recognition (DNN + ELM)
Methods CNN + ELM Based MCDNN

Accuracies 99.48% 99.46%
Training time 5 hours (regular PC) 37 hours (GPU
Implementation)
(ELM may just spend several minutes on training in order to reach 98+% accuracy) [Xu, et al
2015]
78
ELM, SVM and Deep Learning
(a) SVM
Binary Output
d Input Nodes
Unknown features in each layer
(b) Deep Learning
d Input Nodes m Output Nodes

79
ELM, SVM and Deep Learning
(a) ELM
Different from Deep Learning, All the hidden neurons in ELM as
a whole are not required to be iteratively tuned
(b) ELM subnetwork
1 i L Feature learning
Clustering
Problem based Regression
optimization constraints Classification
1 L Hidden nodes need not be tuned. A hidden node

i can be a subnetwork of several nodes.
1 d
80
xj
ELM and Deep Learning
Deep Learning ELMs

Very sensitive to network size, “painful” Stable in a wide range of network size, almost free
manually tuning of human intervention
Difficult in parallel and hardware implementation Easy in parallel and hardware implementation
Lack of theoretical proof Rigorously proved in theory
Different models for feature learning, clustering, Homogenous models for compression, feature
and classifications learning, clustering, regression and classification
Impossible for micro level real-time learning and Easy for micro level real-time learning and control,
control; huge training time is required; difficult up to thousands times faster, efficient for multi-
for multi-channel data fusion and decision channel data fusion and potential for decision
synchronization synchronization
Difficult for online incremental learning and Easy for online incremental learning and prediction
prediction (stream data learning) (streaming data learning)
Only reaching higher accuracy when data is large Reaching higher accuracy in full spectrum of
enough applications, from sparse/small data to large size
of applications
Impossible to have hardware designed for “Brains (devised by ELM)” can be generated before
universal development applications are present
Huge computing resources required (GPU Usually implemented in regular PCs / Laptops /
required, up to tens of thousands of cores FPGA / Chip; The training time would
81
required) significantly be reduced if multi cores are used.
ELM as Fundamentals of Cognition and
Reasoning
Compression
Feature
Learning
Clustering
Regression
Classification
82
ELM Filling Gaps …
Baum
(1988)
Rosenblatt RVFL
Perceptron (1994)
(1958)
Schmidt, QuickNet
et al (1989)
(1992)
Neural network methods
?
?
Biological
learning
?
Feature space methods
PCA SVM
(1995) PSVM
(1901) (2001)
Random
LS-SVM
Projection
(1999)
(1998)
83
Before ELM theory, for these methods:
1) Universal approximation capability was not
Rosenblatt proved for full random hidden nodes case
Baum QuickNet Schmidt, et al RVFL
Perceptron 2) Separation capability was not proved.
(1988) (1989) (1992) (1994)
(1958) 3) Optimization constraints were not used.
4) Dimensionality of hidden maps is usually
lower than number of training data
-: Remove direct links from the

+: generalization theory
+: General type of neurons
input nodes to output nodes
ELM aims to address the open problems:
1) Can learning be made without iteratively
tuning hidden neurons even when the shapes
and modeling of hidden neuron output
functions are unknown?
2) Does there exist unified frameworks for
ELM is efficient in: feedforward neural networks and feature
1) Regression space methods?
2) Classification
3) Clustering
4) Feature learning +:
1) Extend to almost any nonlinear piecewise continuous activation functions (even
unknown shapes and modeling including biological neurons)
2) Extend to kernels and high dimensionality of hidden mappings cases
3) Prove the universal approximation and separation capability of “generalized ” SLFNs.
4) Build the link among ridge regression, system stability, neural network generation
+:
theory, maximal margin, and optimization constraints on network parameters in ELM
1) Random neurons (even with
framework
Biological unknown shapes / modeling)
learning 2) Kernels ELMs 5)
6)
Prove that hidden node parameters can be independent of training data
Prove that random hidden neurons are linearly independent
3) Optimization constraints
7) Use generalization theory for learning optimization
-:
1) Use standard SLFNs instead of RVFL and QuickNet
-: bias b
+: random
2) Remove bias in the output nodes, which are contradictory to biological systems
features
Before ELM theory, for these feature space

Random methods:
PCA SVM LS-SVM PSVM 1) Universal approximation capability may
Projection
(1901) (1995) (1999) (2001) not have been proved.
(1998) 2) Relationship with neural networks is not
very clear.
84
85
Towards Biological Learning, Cognition
and Reasoning?
Biological Learning ELMs
Stable in a wide range (tens to thousands of neurons Stable in a wide range (tens to thousands of
in each module) neurons in each module)
Parallel implementation Easy in parallel implementation
“Biological” implementation Much easier in hardware implementation
Free of user specified parameters Least human intervention
One module possibly for several types of applications One network type for different applications
Fast in micro learning point Fast in micro learning point
Nature in online sequential learning Easy in online sequential learning
Fast speed and high accuracy Fast speed and high accuracy
Brains are built before applications “Brains (devised by ELM)” can be generated
before applications are present
86
Biological Learning vs Computers
• J. von Neumann, Father of Computers’ Puzzles

[Neumann 1951, 1956]
– Why `àn imperfect (biological) neural network,

containing many random connections, can be made to
perform reliably those functions which might be
represented by idealized wiring diagrams” [Rosenblatt 1958]
• 60 Years Later …
• Answered by ELM Learning Theory[Huang, et al 2006, 2007, 2008]
– “As long as the output functions of hidden neurons are nonlinear
piecewise continuous and even if their shapes and modeling are
unknown, (biological) neural networks with random hidden neurons
attain both universal approximation and classification capabilities,
and the changes in finite number of hidden neurons and their
related connections do not affect the overall performance of the
networks.” [Huang 2014] 87
• ELM Learning Theory[Huang, et al 2006, 2007, 2008, 2014, 2015]

– ELM can be used to train wide type of multi hidden layer of
feedforward networks:
• Each hidden layer can be trained by one single ELM based on its role as
feature learning, clustering, regression or classification.
• Entire network as a whole can be considered as a single ELM in which
hidden neurons need not be tuned.
– ELM slice can be `ìnserted” into many local parts of a multi hidden
layer feedforward network, or work together with other learning
architectures / models.
– A hidden node in an ELM slice (a ``generalized” SLFN) can be a
network of several nodes, thus local receptive fields can be formed.
– In each hidden layer, input layers to hidden nodes can be fully or
partially randomly connected according to different continuous
probability distribution function. 88
• ELM Learning Theory[Huang, et al 2006, 2007, 2008, 2014, 2015]

– From ELM theories point of view, the entire multi layers of
networks are structured and ordered, but they may be seemingly
``messy” and `ùnstructured” in a particular layer or neuron slice.
``Hard wiring” can be randomly built locally with full connection or
partial connections.
– Co-existence of globally structured architectures and locally random
hidden neurons happen to have fundamental learning capabilities
of compression, feature learning, clustering, regression and
classification.
– Biological learning mechanisms are sophisticated, we believe that
``learning without tuning hidden neurons” is one of fundamental
biological learning mechanisms in many modules of learning
systems. Furthermore, random hidden neurons and ``random wiring”
are only two specific implementations of such ``learning without
tuning hidden neurons” learning mechanisms. 89
Internet of Intelligent Things
Things
Intelligent
Things
(eg. Intelligent engine,
intelligent devices, intelligent
sensors, intelligent cameras,
etc)
ELMs
90
Society of Intelligent Things
91
Three Stages of Intelligent Things
Society of
Intelligent
Internet of Things
Intelligent • Internet
Things disappearing?
Internet of
Things • Intelligent • From living
things with thing
ELMs intelligence to
• Smart machine
materials, intelligence?
smart sensors
92
Human Intelligence vs Machine
Intelligence
Human
Intelligence
Machine
Intelligence
93
Outline
Part III: ELM Theories and
Incremental/Sequential ELM
tu-logo
ur-logo
ELM Web Portal: www.extreme-learning-machines.org Part III of III: ELM Theories, Incremental/Sequential ELM
Outline
Outline
1 ELM Theories
2 Incremental ELM
3 Enhanced Incremental ELM
4 Online Sequential ELM

tu-logo
ur-logo
Outline
Outline
1 ELM Theories
2 Incremental ELM

tu-logo
ur-logo
Outline
Outline
1 ELM Theories
2 Incremental ELM

tu-logo
ur-logo
Outline
Outline
1 ELM Theories
2 Incremental ELM

tu-logo
ur-logo
ELM Theory I-ELM EI-ELM OS-ELM
Conventional Approximation Capability Theory
Conventional Existence Theorem
1 Any continuous target function f (x) can

be approximated by SLFNs with some
kind of hidden nodes and with
appropriate values for learning
parameters (hidden node parameters
(ai , bi )) and output weights βi .
2 In other words, given any small positive
value , for sigmoid type or RBF type of
SLFNs, there exist a set of hidden node
parameters (ai , bi ) and appropriate
number (L) of hidden nodes such that
kfL (x) − f (x)k < (1)
Figure 1: Feedforward Network Architecture.
tu-logo
M. Leshno, et al., “Multilayer feedforward networks with a nonpolynomial activation function can approximate any
function,” Neural Networks, vol. 6, pp. 861-867, 1993.
J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,” Neural Computation, vol. ur-logo
3, pp. 246-257, 1991.
ELM Learning Theory
New Learning Theory
Given a SLFN with any nonconstant piecewise continuous

hidden nodes G(x, a, b), if
span{G(x, a, b) : (a, b) ∈ Cd × C} is dense in L2 , for any
given positive value , for any continuous target function f
and any randomly generated sequence {(an , bn )Ln=1 },
there exists an integer L0 > 0 such that when L > L0
‚ ‚
‚
‚ XL ‚
‚
‚f (x) − β n n‚ <
g ‚ (2)
‚
‚ n=1 ‚
D E
en−1 ,gn
holds with probability one if βn = ,
kgn k2
Figure 2: Feedforward Network Architecture: any type of nonlinear gn = G(an , bn , x), i = 1, · · · , L.
piecewise continuous G(ai , bi , x).
tu-logo
G.-B. Huang, et al., “Universal approximation using incremental constructive feedforward networks with random
hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006. ur-logo
G.-B. Huang, et al., “Convex incremental learning machine,” Neurocomputing, vol. 70, pp. 3056-3062, 2007.
ELM Learning Theory
New Learning Theory
1 Given a SLFN with a type of nonconstant piecewise

continuous hidden nodes G(x, a, b), if any continuous
target function f (x) can be approximated by such
SLFNs with appropriate hidden node parameters, then
there is no need to find an algorithm to tune the hidden
node parameters.
2 Instead, given any positive value , for any continuous
target function f and any randomly generated sequence
{(ai , bi )Li=1 }, there exists an integer L0 > 0 such that
‚ ‚
when L > L0 , ‚f (x) − Ln=1 βn gn ‚ < holds with
‚ P ‚
D E
en−1 ,gn
probability one if βn = , gn = G(an , bn , x),
kgn k2
i = 1, · · · , L.
3 Thus, for basic ELM with the fixed network architecture
and L random‚ hiddenPnodes, ‚
limL→+∞ ‚f (x) − Li=1 β i gi ‚ = 0 where the output
‚ ‚
weights β i ’s are determined by ordinary least square.

tu-logo
Figure 3: Feedforward Network Architecture: any type of Essence of ELM

nonlinear piecewise continuous G(ai , bi , x).
Hidden node parameters (ai , bi )Li=1 are not only independent
ur-logo
of
target functions f (x) but also of training samples.
ELM Learning Theory
New Learning Theory
1 Given a SLFN with a type of nonconstant piecewise

continuous hidden nodes G(x, a, b), if any continuous
target function f (x) can be approximated by such
SLFNs with appropriate hidden node parameters, then
there is no need to find an algorithm to tune the hidden
node parameters.
2 Instead, given any positive value , for any continuous
target function f and any randomly generated sequence
{(ai , bi )Li=1 }, there exists an integer L0 > 0 such that
‚ ‚
when L > L0 , ‚f (x) − Ln=1 βn gn ‚ < holds with
‚ P ‚
D E
en−1 ,gn
probability one if βn = , gn = G(an , bn , x),
kgn k2
i = 1, · · · , L.
3 Thus, for basic ELM with the fixed network architecture
and L random‚ hiddenPnodes, ‚
limL→+∞ ‚f (x) − Li=1 β i gi ‚ = 0 where the output
‚ ‚
weights β i ’s are determined by ordinary least square.

tu-logo
Figure 3: Feedforward Network Architecture: any type of Essence of ELM

nonlinear piecewise continuous G(ai , bi , x).
Hidden node parameters (ai , bi )Li=1 are not only independent
ur-logo
of
target functions f (x) but also of training samples.
Differences Between ELM and Semi-Random Methods
Difference Between ELM and Baum’s Work

1 Baum (1988): (seen from simulations) one may fix the weights of the
connections on one level and simply adjust the connections on the other level
and no (significant) gain is possible by using an algorithm able to adjust the
weights on both levels simultaneously.
2 However, Baum did not discuss whether all the hidden node biases bi should be
set with the same value. Baum did not discuss either whether the hidden node
biases bi should be tuned or not. ELM theory states that the hidden node
parameters are independent of the training data, which was not found in Baum
(1988).
3 Baum (1988) did not study RBF network and kernel learning, while ELM work for
all these cases.
4 Baum (1988) did not give any theoretical analysis, let alone the proof of universal
approximation capability of ELM.
tu-logo
G.-B. Huang, et al., “Incremental extreme learning machine with fully complex hidden nodes,” Neurocomputing, vol. ur-logo
71, pp. 576-583, 2008.

(1988).
all these cases.
tu-logo
71, pp. 576-583, 2008.

(1988).
all these cases.
tu-logo
71, pp. 576-583, 2008.

(1988).
all these cases.
tu-logo
71, pp. 576-583, 2008.

(1988).
all these cases.
tu-logo
71, pp. 576-583, 2008.
Difference Between ELM and RBF Networks

1 The conventional RBF network (Lowe 1988,Lowe 1989): focus on a specific RBF
networkP with the same impact factor b assigned to all the RBF hidden nodes:
fn (x) = ni=1 βi g(bkx − ai k), where the centers ai can be randomly selected
from the training data instead of tuning, but the impact factor bi of RBF hidden
nodes is not randomly selected and usually
P determined by users. One of RBF
networks interested by ELM is fn (x) = ni=1 βi g(bi kx − ai k) where the RBF
hidden nodes are not requested to have the same impact factors bi .
RBF networks fn (x) = ni=1 βi g(bkx − ai k) (studied by Lowe 1988,Lowe 1989)
P
2
with randomly generated centers ai and randomly generated same values of
impact factors b in fact does not generally have
P the universal approximation
capability, in contrast, RBF networks fn (x) = ni=1 βi g(bi kx − ai k) (in ELM) with
randomly generated centers ai and randomly generated impact factors bi (with
different values) does generally have the universal approximation capability.
3 ELM works for different type of hidden nodes including different type of RBF
nodes (not limited to g(bkx − ai k)), additive nodes, kernels, etc while tu-logo
conventional RBF networks (Lowe1988,Lowe1989) only work for specific type of
RBF networks with single impact factor value for all RBF nodes.
ur-logo
G.-B. Huang, et al., “Incremental extreme learning machine with fully complex hidden nodes,” Neurocomputing, vol.
71, pp. 576-583, 2008.

P
2
ur-logo
71, pp. 576-583, 2008.

P
2
ur-logo
71, pp. 576-583, 2008.

P
2
ur-logo
71, pp. 576-583, 2008.
Difference Between ELM and RVFL

1 In a random vector version of the functional-link (RVFL) model (Igelnik 1995), the
input weights ai are “uniformly” drawn from a probabilistic space
Vαd = [0, αΩ] × [−αΩ, αΩ]d−1 (d: the input dimension). The hidden node biases
bi depend on the weights ai and some other parameters yi and ui :

bi = −(αai · yi + ui ), where yi and ui are randomly generated from [0, 1]d and
[−2Ω, 2Ω]. α and Ω have to be determined in the learning stage and depends on
the training data distribution.
2 In ELM, the hidden node parameters (ai , bi ) are not only independent of the
training data but also of each other.
3 In ELM, ai ’s and bi ’s are independent of each other.
tu-logo
71, pp. 576-583, 2008.


tu-logo
71, pp. 576-583, 2008.


tu-logo
71, pp. 576-583, 2008.


tu-logo
71, pp. 576-583, 2008.
Incremental Extreme Learning Machine (I-ELM)
I-ELM
Given a training set ℵ = {(xi , ti )|xi ∈ Rn , ti ∈ Rm , i = 1, · · · , N}, hidden node output
function G(a, b, x), maximum node number Lmax and expected learning accuracy ,
1 Initialization: Let L = 0 and residual error E = t, where t = [t1 , · · · , tN ]T .
2 Learning step:
while L < Lmax and kEk >
- Increase by 1 the number of hidden nodes L: L = L + 1.
- Assign random hidden node parameter (aL , bL ) for new hidden node L. D E
E·HLT eL−1 ,gL
- Calculate the output weight βL for the new hidden node: βL = ≈
HL ·H T kgL k2
L
- Calculate the residual error after adding the new hidden node L: E = E − βL · HL
endwhile
where HL = [h(1), · · · , h(N)]T is the activation vector of the new node L for all the N tu-logo
training samples and E = [e(1), · · · , e(N)]T is the residual vector. E · HLT ≈ heL−1 , gL i
and HL · HLT ≈ kgL k2 .
ur-logo
I-ELM
2 Learning step:
E·HLT eL−1 ,gL
HL ·H T kgL k2
L
endwhile
ur-logo
I-ELM
2 Learning step:
E·HLT eL−1 ,gL
HL ·H T kgL k2
L
endwhile
ur-logo
I-ELM
2 Learning step:
E·HLT eL−1 ,gL
HL ·H T kgL k2
L
endwhile
ur-logo
I-ELM
2 Learning step:
E·HLT eL−1 ,gL
HL ·H T kgL k2
L
endwhile
ur-logo
I-ELM
2 Learning step:
E·HLT eL−1 ,gL
HL ·H T kgL k2
L
endwhile
ur-logo
I-ELM
2 Learning step:
E·HLT eL−1 ,gL
HL ·H T kgL k2
L
endwhile
ur-logo
Performance of I-ELM with RBF hidden nodes
Figure 5: Average training time (seconds) tu-logo

Figure 4: Average testing RMSE
G.-B. Huang, et al., “Universal approximation using incremental constructive feedforward networks with random ur-logo
hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.
Real-World Regression Problems
Problems I-ELM RAN MRAN

Sigmoid RBF Sin
Abalone 0.0920 0.0938 0.0886 0.1183 0.0906
Auto Price 0.0977 0.1261 0.1162 0.1418 0.1373
Boston Housing 0.1167 0.1320 0.1404 0.1474 0.1321
California Housing 0.1683 0.1731 0.1550 0.1506 0.1480
Census (House8L) 0.0923 0.0922 0.0842 0.1061 0.0903
Delta Ailerons 0.0525 0.0632 0.0635 0.1018 0.0618
Delta Elevators 0.0740 0.0790 0.0739 0.1322 0.0807
Machine CPU 0.0504 0.0674 0.0665 0.1069 0.1068
Table 1: Average testing RMSE of different algorithms. (I-ELM with 200 hidden nodes)
tu-logo
ur-logo
Problems I-ELM RAN MRAN

Sigmoid RBF Sin
Abalone 0.0046 0.0053 0.0049 0.0076 0.0065
Auto Price 0.0069 0.0255 0.0179 0.0261 0.0381
Boston Housing 0.0112 0.0126 0.0114 0.0177 0.0140
California Housing 0.0049 0.0081 0.0052 0.0035 0.0030
Census (House8L) 0.0023 0.0029 0.0015 0.0038 0.0042
Delta Ailerons 0.0078 0.0116 0.0090 0.0083 0.0050
Delta Elevators 0.0126 0.0123 0.0065 0.0130 0.0068
Machine CPU 0.0079 0.0177 0.0278 0.0246 0.0367
Table 2: Standard deviations (Dev) of testing RMSE of different algorithms. (I-ELM with 200 hidden nodes)
tu-logo
ur-logo
Problems Training time of I-ELM Training Time # nodes

Sigmoid RBF Sin RAN MRAN RAN MRAN
Abalone 0.2214 0.5030 0.1778 39.928 255.84 186.3 67.7
Auto Price 0.0329 0.0468 0.0188 0.3565 2.5015 23.8 22.5
Boston Housing 0.0515 0.0657 0.0470 2.0940 22.767 40.5 36.2
California Housing 0.5448 1.3656 0.3872 3301.7 2701.1 4883.0 93.0
Census (House8L) 0.8667 1.7928 0.5194 5399.0 3805.3 6393.2 77.3
Delta Ailerons 0.2620 0.4327 0.1715 237.96 175.07 1118.1 76.6
Delta Elevators 0.2708 0.6321 0.2261 661.78 331.75 2417.4 76.8
Machine CPU 0.0234 0.0447 0.0297 0.1735 0.2454 6.9 7.0
Table 3: Training time (seconds) and network complexity comparison of different algorithms
tu-logo
ur-logo
Problems I-ELM SGBP (λ = 1) SVR

Mean Dev Mean Dev Mean Dev
Abalone 0.0878 0.0032 0.1175 0.0095 0.0846 0.0013
Auto Price 0.0883 0.0036 0.2383 0.0587 0.1052 0.0040
Boston Housing 0.1095 0.0090 0.1882 0.0243 0.1155 0.0079
California Housing 0.1555 0.0021 0.1579 0.0033 0.1311 0.0011
Census (House8L) 0.0871 0.0021 0.0866 0.0025 0.0683 0.0013
Delta Ailerons 0.0472 0.0049 0.0459 0.0033 0.0467 0.0010
Delta Elevators 0.0639 0.0067 0.0653 0.0019 0.0603 0.0005
Machine CPU 0.0491 0.0089 0.1988 0.0429 0.0620 0.0180
Table 4: Performance comparison (testing RMSE and the corresponding standard deviation) of I-ELM (with 500
random sigmoid hidden nodes), stochastic gradient descent BP (SGBP), and SVR.
tu-logo
ur-logo
Problems I-ELMa SGBPa (λ = 1) SVRb

Time (s) Time (s) # Nodes Time (s) # SVs (C, γ)
Abalone 0.5560 0.4406 10 1.6123 309.84 (24 , 2−6 )
Auto Price 0.0954 0.0154 15 0.0042 21.25 (28 , 2−5 )
Boston Housing 0.1419 0.0579 10 0.0494 46.44 (24 , 2−3 )
California Housing 1.3763 2.0307 10 74.184 2189.2 (23 , 21 )
Census (House8L) 1.7295 2.7814 30 11.251 810.24 (21 , 2−1 )
Delta Ailerons 0.7058 0.6610 10 0.6726 82.44 (23 , 2−3 )
Delta Elevators 0.7296 0.8830 10 1.1210 260.38 (20 , 2−2 )
Machine CPU 0.0765 0.0206 10 0.0018 7.8 (26 , 2−4 )
a b
run in MATLAB environment. run in C executable environment.
Table 5: Performance comparison (training time (seconds)) of I-ELM (with 500 random sigmoid hidden nodes),
stochastic gradient descent BP (SGBP), and SVR.
tu-logo
ur-logo
Problems Testing RMSE Dev of Testing RMSE Training Time (s)

I-ELM SGBP I-ELM SGBP I-ELM SGBP
(Threshold) (λ = 10) (Threshold) (λ = 10) (Threshold) (λ = 10)
Abalone 0.0951 0.1332 0.0142 0.0102 0.2908 0.4313
Auto Price 0.1141 0.3209 0.0130 0.0665 0.0735 0.0172
Boston Housing 0.1346 0.2196 0.0104 0.0279 0.0907 0.0548
California Housing 0.1828 0.1806 0.0179 0.0226 0.8186 1.9548
Census (House8L) 0.0941 0.1032 0.0062 0.0068 1.0117 2.7359
Delta Ailerons 0.0790 0.0400 0.0397 0.0055 0.3550 0.6375
Delta Elevators 0.0713 0.0895 0.0110 0.0090 0.5102 0.8970
Machine CPU 0.0739 0.2281 0.0140 0.0479 0.0658 0.0215
Table 6: Performance comparison between the approximated threshold network (λ = 10) trained by stochastic
gradient descent BP (SGBP) and the true threshold networks trained by I-ELM with 500 threshold nodes:
g(x) = −1x<0 + 1x≥0 .
tu-logo
ur-logo
Enhanced Incremental ELM (EI-ELM)
New Convergence Theorem

Given a SLFN with any nonconstant piecewise continuous hidden nodes G(x, a, b), if
span{G(x, a, b) : (a, b) ∈ Cd × C} is dense in L2 , for any continuous target function f
and any randomly generated function sequence {gn } and any positive integer k,
limn→∞ kf − fn∗ k = 0 holds with probability one if
D E
e∗n−1 , g∗n
βn∗ = (3)
kg∗n k2
where fn∗ = ni=1 βi∗ g∗i , e∗n = f − fn∗ and

P
∗
gn = {gi | min(n−1)k+1≤i≤nk k(f − fn−1∗ ) − β g k}.
n i
tu-logo
G.-B. Huang and L. Chen, “Enhanced random search based incremental extreme learning machine,”
Neurocomputing, vol. 71, pp. 3460-3468, 2008. ur-logo
EI-ELM Algorithm

2 Learning step:
+ Increase by 1 the number of hidden nodes L: L = L + 1.
+ for i = 1 : k
- Assign random parameters (a(i) , b(i) ) for the new hidden node L according to any continuous
sampling distribution probability.
E·H T
(i)
- Calculate the output weight β(i) for the new hidden node: β(i) =
H(i) ·H T
(i)
- Calculate the residual error after adding the new hidden node L: E(i) = E − β(i) · H(i)
endfor
+ Let i∗ = {i| min1≤i≤k kE(i) k}. Set E = E(i) , aL = a(i∗ ) , bL = b(i∗ ) , and βL = β(i∗ ) .
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
EI-ELM Algorithm

2 Learning step:
+ for i = 1 : k
E·H T
(i)
H(i) ·H T
(i)
endfor
tu-logo
endwhile
ur-logo
tu-logo
Figure 6: The testing error updating curves of EI-ELM and I-ELM
ur-logo
tu-logo
Figure 7: Testing RMSE performance comparison between EI-ELM and I-ELM (with Sigmoid hidden nodes) for
Abalone case
ur-logo
tu-logo
Figure 8: Testing RMSE updating progress with new hidden nodes added and different number of selecting trials k in
Airplane case
ur-logo
Problems EI-ELM (50 Sigmoid hidden nodes) I-ELM (500 Sigmoid

k = 10 k = 20 hidden nodes, k = 1)
Abalone 0.0878 0.0033 0.0876 0.0015 0.0876 0.0033
Ailerons 0.0640 0.0066 0.0571 0.0022 0.0824 0.0232
Airplane 0.0922 0.0061 0.0862 0.0040 0.0898 0.0067
Auto Price 0.0924 0.0112 0.0897 0.0104 0.0948 0.0158
Bank 0.1066 0.0058 0.0896 0.0036 0.0757 0.0032
Boston 0.1133 0.0101 0.1102 0.0061 0.1084 0.0096
California 0.1591 0.0034 0.1548 0.0033 0.1543 0.0019
Census (8L) 0.0899 0.0017 0.0865 0.0011 0.0871 0.0018
Computer Activity 0.1075 0.0057 0.0991 0.0036 0.1057 0.0078
Delta Ailerons 0.0474 0.0062 0.0467 0.0042 0.0468 0.0052
Delta Elevators 0.0615 0.0049 0.0586 0.0038 0.0640 0.0055
Kinematics 0.1420 0.0029 0.1416 0.0019 0.1406 0.0014
Machine CPU 0.0498 0.0155 0.0467 0.0148 0.0474 0.0040
Puma 0.1846 0.0018 0.1827 0.0017 0.1856 0.0039
Pyrim 0.1514 0.0419 0.1300 0.0405 0.1712 0.0626
Servo 0.1634 0.0129 0.1558 0.0121 0.1589 0.0124
tu-logo
Table 7: Performance comparison between EI-ELM with 50 Sigmoid hidden nodes and I-ELM with 500 Sigmoid
hidden nodes
ur-logo
Problems EI-ELM (50 RBF hidden nodes) I-ELM (500 RBF

k = 10 k = 20 hidden nodes, k = 1)
Abalone 0.0907 0.0034 0.0871 0.0023 0.0872 0.0022
Ailerons 0.0973 0.0229 0.0775 0.0033 0.1129 0.0295
Airplane 0.0943 0.0168 0.0813 0.0102 0.0772 0.0082
Auto Price 0.1187 0.0159 0.1104 0.0148 0.1231 0.0133
Bank 0.0989 0.0031 0.0888 0.0023 0.0843 0.0058
Boston 0.1197 0.0107 0.1171 0.0078 0.1214 0.0103
California 0.1624 0.0049 0.1579 0.0027 0.1582 0.0027
Census (8L) 0.0864 0.0026 0.0846 0.0020 0.0860 0.0018
Computer Activity 0.1295 0.0068 0.1201 0.0024 0.1358 0.0177
Delta Ailerons 0.0469 0.0067 0.0466 0.0039 0.0544 0.0076
Delta Elevators 0.0603 0.0049 0.0602 0.0039 0.0685 0.0099
Kinematics 0.1346 0.0025 0.1306 0.0019 0.1425 0.0095
Machine CPU 0.0622 0.0281 0.0511 0.0114 0.0614 0.0274
Puma 0.1789 0.0020 0.1770 0.0012 0.1850 0.0119
Pyrim 0.1214 0.0345 0.0989 0.0286 0.2179 0.1545
Servo 0.1487 0.0133 0.1434 0.0120 0.1410 0.0151
tu-logo
Table 8: Performance comparison between EI-ELM with 50 RBF hidden nodes and I-ELM with 500 RBF hidden
nodes
ur-logo
Nature of Sequential Learning
Natural Learning
1 The training observations are sequentially (one-by-one or
chunk-by-chunk with varying or fixed chunk length) presented to the
learning algorithm/system.
2 At any time, only the newly arrived single or chunk of observations
(instead of the entire past data) are seen and learned.
3 A single or a chunk of training observations is discarded as soon as the
learning procedure for that particular (single or chunk of) observation(s)
is completed.
4 The learning algorithm/system has no prior knowledge as to how many
training observations will be presented.
G.-B. Huang, et al., “A generalized growing and pruning RBF (GGAP-RBF) neural network for function tu-logo
approximation,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 57–67, 2005.
N.-Y. Liang, et al., “A fast and accurate on-line sequential learning algorithm for feedforward networks”, IEEE
ur-logo
Transactions on Neural Networks, vol. 17, no. 6, pp. 1411-1423, 2006.
Natural Learning
is completed.
ur-logo
Natural Learning
is completed.
ur-logo
Natural Learning
is completed.
ur-logo
Natural Learning
is completed.
ur-logo
Popular Sequential Learning Methods
RAN Based
1 RAN, MRAN, GAP-RBF, GGAP-RBF
2 At any time, only the newly arrived single observation is seen and learned
3 They do not handle chunks of training observations
4 Many control parameters need to be fixed by human. Very laborious! Very
tedious!
5 Training time is usually huge!!
6 Many control parameters need to be fixed by human
BP Based
1 Stochastic gradient BP (SGBP) tu-logo
2 It may handle chunks of training observations
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
RAN Based
tedious!
BP Based
ur-logo
Online Sequential ELM (OS-ELM)
Two-Step Learning Model

1 Initialization phase: where batch ELM is used to initialize the learning system.
2 Sequential learning phase: where recursive least square (RLS) method is
adopted to update the learning system sequentially.
tu-logo

ur-logo

tu-logo

ur-logo

tu-logo

ur-logo

H0 T0
H1 β − T1 (4)

T
H0 T0
β (1)
= K−1
1 H1 T1
(5)
= K−1
1 (K1 β
(0)
− HT1 H1 β (0) + HT1 T1 )
= β (0) + K−1 T (0)
1 H1 (T1 − H1 β )
where β (1) is the output weight for all the data learned so far,
T
H0 H0 tu-logo
K1 = = K0 +HT1 H1 , K0 = HT0 H0 , β (0) = K−1 T
0 H0 T0
H1 H1
(6) ur-logo
Datasets Algorithms Time RMSE # nodes

(seconds) Training Testing
OS-ELM (Sigmoid) 0.0444 0.0680 0.0745 25
OS-ELM (RBF) 0.0915 0.0696 0.0759 25
Stochastic BP 0.0875 0.1112 0.1028 13
Auto-MPG GAP-RBF 0.4520 0.1144 0.1404 3.12
MRAN 1.4644 0.1086 0.1376 4.46
OS-ELM (Sigmoid) 0.5900 0.0754 0.0777 25
OS-ELM (RBF) 1.2478 0.0759 0.0783 25
Stochastic BP 0.7472 0.0996 0.0972 11
Abalone GAP-RBF 83.784 0.0963 0.0966 23.62
MRAN 1500.4 0.0836 0.0837 87.571
OS-ELM (Sigmoid) 3.5753 0.1303 0.1332 50
OS-ELM (RBF) 6.9629 0.1321 0.1341 50
California Stochastic BP 1.6866 0.1688 0.1704 9
Housing GGAP-RBF 115.34 0.1417 0.1386 18
MRAN 2891.5 0.1598 0.1586 64
Table 9: Comparison between OS-ELM and other sequential algorithms on regression applications. tu-logo
ur-logo
Real-World Classification Problems
Datasets Algorithms Time Accuracy (%) # nodes

(seconds) Training Testing
OS-ELM (Sigmoid) 9.9981 97.00 94.88 180
Image OS-ELM (RBF) 12.197 96.65 94.53 180
Segmentation Stochastic BP 2.5776 83.71 82.55 80
GAP-RBF 1724.3 - 89.93 44.2
MRAN 7004.5 - 93.30 53.1
OS-ELM (Sigmoid) 302.48 91.88 88.93 400
Satellite OS-ELM (RBF) 319.14 93.18 89.01 400
Image Stochastic BP 3.1415 85.23 83.75 25
MRAN 2469.4 - 86.36 20.4
OS-ELM (Sigmoid) 16.742 95.79 93.43 200
DNA OS-ELM (RBF) 20.951 96.12 94.37 200
Stochastic BP 1.0840 85.64 82.11 12
MRAN 6079.0 - 86.85 5
Table 10: Comparison between OS-ELM and other sequential algorithms on classification applications.
tu-logo
ur-logo
Time-Series Problems
Algorithms Time (seconds) Training RMSE Testing RMSE #nodes

OS-ELM (Sigmoid) 7.1148 0.0177 0.0183 120
OS-ELM (RBF) 10.0603 0.0184 0.0186 120
GGAP-RBF 24.326 0.0700 0.0368 13
MRAN 57.205 0.1101 0.0337 16
RANEKF 62.674 0.0726 0.0240 23
RAN 58.127 0.1006 0.0466 39
Table 11: Comparison between OS-ELM and other sequential algorithms on Mackey-Glass time series application.
tu-logo
ur-logo
Datasets Activation Algorithms Learning Time RMSE #

Functions Mode (seconds) Training Testing nodes
Sigmoid ELM Batch 0.0053 0.0697 0.0694 25
1-by-1 0.0444 0.0680 0.0745 25
OS-ELM 20-by-20 0.0150 0.0684 0.0738 25
Auto - [10,30] 0.0213 0.0680 0.0765 25
MPG RBF ELM Batch 0.0100 0.0691 0.0694 25
1-by-1 0.0915 0.0696 0.0759 25
OS-ELM 20-by-20 0.0213 0.0686 0.0769 25
[10,30] 0.0250 0.0692 0.0746 25
1-by-1 3.5753 0.1303 0.1332 50
OS-ELM 20-by-20 0.6500 0.1297 0.1333 50
California [10,30] 0.8338 0.1302 0.1327 50
Housing RBF ELM Batch 1.0210 0.1292 0.1312 50
1-by-1 6.9629 0.1321 0.1341 50
OS-ELM 20-by-20 0.9794 0.1312 0.1333 50
[10,30] 1.3241 0.1305 0.1326 50
Table 12: Performance comparison of ELM and OS-ELM on regression applications. tu-logo
ur-logo
Real-World Classification Problems
Datasets Activation Algorithms Learning Time Accuracy (%) #

OS-ELM 1-by-1 9.9981 97.00 94.88 180
20-by-20 1.0922 97.05 94.60 180
Image [10,30] 0.9881 97.00 94.92 180
Segmentation RBF ELM Batch 1.6300 96.22 94.91 180
OS-ELM 1-by-1 12.197 96.65 94.53 180
20-by-20 1.4275 96.70 94.55 180
[10,30] 1.4456 96.75 94.60 180
OS-ELM 1-by-1 16.743 95.79 93.43 200
20-by-20 1.7322 95.87 93.46 200
DNA [10,30] 1.7875 95.81 93.42 200
RBF ELM Batch 8.2998 95.87 92.33 200
OS-ELM 1-by-1 20.951 96.12 94.37 200
20-by-20 2.6538 96.19 94.30 200
[10,30] 2.8814 96.17 94.25 200
Table 13: Performance comparison of ELM and OS-ELM on classification applications. tu-logo
ur-logo
Time-Series Problems
Activation Algorithms Learning Time RMSE #

OS-ELM 1-by-1 7.1184 0.0177 0.0183 120
20-by-20 0.9894 0.0177 0.0183 120
[10,30] 1.0440 0.0185 0.0190 120
RBF ELM Batch 2.1794 0.0185 0.0180 120
OS-ELM 1-by-1 10.060 0.0184 0.0186 120
20-by-20 1.5574 0.0183 0.0186 120
[10,30] 1.7441 0.0184 0.0187 120
Table 14: Performance comparison of ELM and OS-ELM on Mackey-Glass time series application.
tu-logo
ur-logo
Intelligent Photo Notification System For Twitter Service
tu-logo
ur-logo
K. Choi, et al., “Incremental face recognition for large-scale social network services”, Pattern Recognition, vol. 45,
pp. 2868-2883, 2012.
Intelligent Photo Notification System For Twitter Service
Figure 9: Binary Gabor filter-based OS-ELM (BG-OSELM)
Methods Baseline Sequential Subspace Sequential Classifiers

Database PCA FDA CCIPCA IPCA ILDA OSELM BG-OSELM(S) BG-OSELM(V)
AR 77.0 72.3 55.0 77.3 76.6 80.3 92.0 87.6
EYALE 99.7 96.9 58.5 99.7 100.0 100.0 99.7 99.7 tu-logo
BIOID 98.1 97.3 91.6 97.5 - 98.5 97.4 96.7
ETRI 95.8 95.5 86.9 95.4 - 97.2 97.0 94.2
Table 15: Performance comparison of different sequential methods. ur-logo
Online Sequential Human Action Recognition
Figure 10: Example frames from top row: Weizmann dataset, middle row: KTH dataset, and bottom row: UCF
sports dataset
tu-logo
R. Minhas, et al., “Incremental learning in human action recognition based on Snippets”, (in press) IEEE
Transactions on Circuits and Systems for Video Technology, 2012. ur-logo
tu-logo
Transactions on Circuits and Systems for Video Technology, 2012.

ur-logo
tu-logo
ur-logo
Figure 11: Tracking results using action videos of run, kick, golf and dive (top to bottom) from UCF Sports dataset
Weizmann dataset
Methods OS-ELM Based [32] [14] [36] [11]
Frames 1/1 3/3 6/6 10/10 1/12 1/9 1/1 7/7 10/10 8/8 20/20
Accuracy 65.2 95.0 99.63 99.9 55.0 93.8 93.5 96.6 99.6 97.05 98.68
KTH dataset
Methods OS-ELM Based [25] [33] [43] [14] [36] [12]
Frames 1/1 3/3 6/6 10/10 - - - - 1/1 7/7 20/20
Accuracy 74.4 88.5 92.5 94.4 91.3 90.3 83.9 91.7 88.0 90.9 90.84
Table 16: Classification comparison against different approaches at snippet-level.
Weizmann dataset
Methods OS-ELM Based [2] [32] [14] [36] [41] [30] [11]
Frames 1/1 3/3 6/6 10/10 - - - - - - -
Accuracy 100.0 100.0 100.0 100.0 100.0 72.8 98.8 100.0 97.8 99.44 100.0
KTH dataset
Methods OS-ELM Based [14] [36] [30] [21] [27] [9] [44]
Frames 1/1 3/3 6/6 10/10 - - - - - - -
Accuracy 92.8 93.5 95.7 96.1 91.7 92.7 94.83 95.77 97.0 96.7 95.7
tu-logo
Table 17: Classification comparison against different approaches at sequence-level.
ur-logo
Transactions on Circuits and Systems for Video Technology, 2012.
Open Problems
1 As observed in experimental studies, the performance of basic ELM is

stable in a wide range of number of hidden nodes. Compared to the BP
learning algorithm, the performance of basic ELM is not very sensitive
to the number of hidden nodes. However, how to prove it in theory
remains open.
2 One of the typical implementations of ELM is to use random nodes in
the hidden layer and the hidden layer of SLFNs need not be tuned. It is
interesting to see that the generalization performance of ELM turns out
to be very stable. How to estimate the oscillation bound of the
generalization performance of ELM remains open too.
3 It seems that ELM performs better than other conventional learning
algorithms in applications with higher noise. How to prove it in theory is
not clear. tu-logo
4 ELM always has faster learning speed than LS-SVM if the same kernel
is used?
ur-logo
Open Problems
5 ELM provides a batch learning kernel solution which is much simpler

than other kernel learning algorithms such as LS-SVM. It is known that
it may not be straightforward to have an efficient online sequential
implementation of SVM and LS-SVM. However, due to the simplicity of
ELM, is it possible to implement the online sequential variant of the
kernel based ELM?
6 ELM always provides similar or better generalization performance than
SVM and LS-SVM if the same kernel is used (if not affected by
computing devices’ precision)?
7 ELM tends to achieve better performance than SVM and LS-SVM in
multiclasses applications, the higher the number of classes is, the larger
the difference of their generalization performance will be?
8 Scalability of ELM with kernels in super large applications. tu-logo
9 Parallel and distributed computing of ELM.

10 ELM will make real-time reasoning feasible? ur-logo

ELM Tutorial

Uploaded by

ELM Tutorial

Uploaded by

Extreme Learning Machines (ELM)

– Filling the Gap between Frank Rosenblatt's Dream and

Guang-Bin Huang (黄广斌)

• Part I - ELM Philosophy and Generalized SLFN Cases:

• Part II – Hierarchical ELM

ELM Philosophy and Generalized

• Cognition Dream in 60 Years Ago …

– “Rosenblatt made statements about the perceptron

– Cognition: “Based on Rosenblatt's statements, The

• “AI Winter” in 1970s

2010 – Present: Data

1950s 1970s 1980s 2010 Present Time

Three necessary conditions of true

The output function of SLFNs is:

g(x) g(x) g(x)

(a) (b) (c)

1 s Ns LS-SVM optimization formula

Typical kernel function: sign ,

• Interesting 20 Years of Cycles

• There are exceptional cases:

Difficult for online sequential learning Nature in online sequential learning

Output function of “generalized” SLFNs:

Random Hidden Neurons (which need not be algebraic Sigmoid: , , ·

• New Learning Theory - Learning Without Iteratively Tuning Hidden

• Direct biological evidence later found in 2013 [Fusi, 2013]

The hidden layer output function (hidden

The output functions of hidden nodes can be

Conventional Random Projection is just a

The hidden layer output function (hidden

The output functions of hidden nodes can be

Almost any nonlinear piecewise continuous hidden

• Some conventional methods adopted “semi-randomness”

– Hidden layer mapping satisfies universal approximation

• Three-Step Learning Model [Huang, et al 2004, 2006]

Given a training set , ∈ , ∈ , 1, ⋯ , , hidden

1) Assign randomly hidden node parameters , , 1, ⋯ , .

2) Calculate the hidden layer output matrix ⋮ .

3) Calculate the output weights .

ELM Web portal: www.extreme-learning-machines.org

• Ridge regression theory based ELM

• Equivalent ELM optimization formula

• Valid for both kernel and non-kernel learning

Object ELM Based AdaBoost Joint Boosting Scale-Invariant

Sample images from CalTech database

Training time shortened from 8 hours

• Both PDP Group and V. Vapnik have made great

• However, both BP and SVM over-emphasize some

• Binary / Threshold node:

• Threshold networks (in fact approximated by sigmoid

• Inverse circular functions:

• Inverse hyperbolic functions:

• Wireless Communication Channel Equalizer

Complex activation function used in ELM: tanh , where

• Wireless Communication Channel Equalizer

Eye diagram of the outputs of different

• Save Energy in Wireless Communication

SER versus SNR: (a) Linear DFE. (b)

• ELM: Based on Equality Constraint Conditions [Huang, et al 2012]

• LS-SVM: Based on Equality Constraint Conditions [Suykens and

– LS-SVM optimization formula:

• ELM: Based on Inequality Constraint Conditions [Huang, et al

– ELM optimization formula:

• SVM: Based on Inequality Constraint Conditions [Cortes and

Origin of · 1 Origin of =-1

• Data distortion in multi-class classifications?

d Input Nodes L Hidden Nodes m Output Nodes

ELM Feature Mapping ELM Learning

Features represented by the output weights of

(a) ELM (b) SVD

Learning Methods Testing Training Time

Multi-Layer ELM (784-700- 99.03 . 444.7s

Conventional: Heterogeneous combinations

Conventional: Homogeneous combinations

Frame (n+1) Auto-adaptive Detection System Tracking Result

Circle Feature Dector Classication

Compressive Tracking (CT)

kfL (x) − f (x)k < (1)