AGeneralised Methodology

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320363801

A Generalized Methodology for Data Analysis

Article in IEEE Transactions on Cybernetics · October 2017


DOI: 10.1109/TCYB.2017.2753880

CITATIONS READS
67 517

3 authors:

Plamen P Angelov Xiaowei Gu


Lancaster University University of Surrey
343 PUBLICATIONS 9,221 CITATIONS 99 PUBLICATIONS 1,260 CITATIONS

SEE PROFILE SEE PROFILE

Jose C Principe
University of Florida
1,277 PUBLICATIONS 31,465 CITATIONS

SEE PROFILE

All content following this page was uploaded by Xiaowei Gu on 18 January 2018.

The user has requested enhancement of the downloaded file.


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

A Generalized Methodology for Data Analysis


Plamen P. Angelov , Fellow, IEEE, Xiaowei Gu , and José C. Príncipe, Fellow, IEEE

Abstract—Based on a critical analysis of data analytics and to be further confirmed with the experimental data (e.g., the
its foundations, we propose a functional approach to estimate Gaussian assumption). The core of the statistical approach is
data ensemble properties, which is based entirely on the empirical the definition of a random variable, i.e., a functional measure
observations of discrete data samples and the relative proximity
of these points in the data space and hence named empiri- from the space of events to the real line, which defines the
cal data analysis (EDA). The ensemble functions include the probability law [1]–[4]. The probability density function (pdf)
nonparametric square centrality (a measure of closeness used is, by definition, the derivative of the cumulative distribution
in graph theory) and typicality (an empirically derived quan- function (cdf). It is well known that differentiation can create
tity which resembles probability). A distinctive feature of the numerical problems in both practical and theoretical aspects

n
proposed new functional approach to data analysis is that it
does not assume randomness or determinism of the empirically and is a challenge for functions that are not analytically defined
observed data, nor independence. The typicality is derived from or are complex. In reality, we usually do not have independent
the discrete data directly in contrast to the traditional approach, and identically distributed (iid) events, but we do have corre-

io
where a continuous probability density function is assumed a pri- lated, interdependent (albeit in a complex and often unknown
ori. The typicality is expressed in a closed analytical form that manner) data from different experiments which complicates
can be calculated recursively and, thus, is computationally very
efficient. The proposed nonparametric estimators of the ensemble the procedure.
properties of the data can also be interpreted as a discrete form of The appeal of the traditional statistical approach is its solid
the information potential (known from the information theoretic mathematical foundation and the ability to provide guarantees
learning theory as well as the Parzen windows). Therefore, EDA is
very suitable for the current move to a data-rich environment,
where the understanding of the underlying phenomena behind
the available vast amounts of data is often not clear. We also
present an extension of EDA for inference. The areas of appli-
rs of performance, when data is plenty (N → ∞), and created
from the same distribution that was hypothesized in the prob-
ability law. The actual data is usually discrete (or discretized),
which in traditional probability theory and statistics are mod-
eled as realizations of the random variables, but one does not
Ve
cations of the new methodology of the EDA are wide because
it concerns the very foundation of data analysis. Preliminary know a priori their distribution. If the prior data generation
tests show its good performance in comparison to traditional hypothesis is verified, good results can be expected; otherwise
techniques.
this opens the door for many failures.
Index Terms—Data mining and analysis, machine learning, Even in the case that the hypothesized measure meets the
pattern recognition, probability, statistics. realizations, one has to address the difference of working with
realizations and random variables, which brings the issue of
I. I NTRODUCTION choosing estimators of the statistical quantities necessary for
data analysis. This is not a trivial problem, and is seldom
URRENTLY, there is a growing demand in machine
C
l

discussed in data analysis. The simple determination of the


learning, pattern recognition, statistics, data mining, and probability law (the measure of the random variable) that
na

a number of related disciplines broadly called data science, explains the collected data is a hard problem as studied in
for new concepts and methods that are centered on the density estimation [1]–[3]. Moreover, if we are interested in
actual data, the evidence collected from the real world statistical inference, for instance, similarity between two ran-
rather than at theoretical prior assumptions, which need dom variables using mutual information, the problem gets
even harder because different estimators may provide different
Manuscript received July 13, 2017; accepted September 7, 2017. This work
was supported by The Royal Society “Novel Machine Learning Paradigms results [5]. The reason is that very likely the functional proper-
Fi

to address Big Data Streams,” under Grant IE141329/2014. This paper ties of the chosen estimator do not preserve all the properties
was recommended by Associate Editor Y. Zhang. (Corresponding author: embodied in the statistical quantity. Therefore, they behave
Xiaowei Gu.)
P. P. Angelov is with the School of Computing and Communications, differently in the finite (and even in the infinite) sample case.
Lancaster University, Lancaster LA1 4WA, U.K., and also holds an An alternative approach is to proceed from the realizations
Honorary Professor title with Technical University, Sofia, Bulgaria (e-mail: to the random variables, which is the reverse direction of
[email protected]).
X. Gu is with the School of Computing and Communications, Lancaster the statistical approach. The literature has several excel-
University, Lancaster LA1 4WA, U.K. (e-mail: [email protected]). lent examples of this approach, in the area of measures of
J. C. Príncipe is with the Computational NeuroEngineering Laboratory, association. For instance, Pearson’s correlation coefficient is
Department of Electrical and Computer Engineering, University of Florida,
Gainesville, FL 32611 USA (e-mail: [email protected]). perfectly well defined in realizations, as well as in random
This paper has supplementary downloadable multimedia material available variables. Likewise, Spearman’s ρ [6], Kendall’s τ [7], are
at http://ieeexplore.ieee.org provided by the authors. other examples of measures of association well defined in
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. both the realization and the random variables. However, the
Digital Object Identifier 10.1109/TCYB.2017.2753880 problem with this approach is that the statistical properties of
2168-2267 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

the measures in the random variables are not directly known, density, DG and typicality, τ G , which further involves inte-
and may not be easily obtained. A good example of the lat- gral for normalization. Furthermore, we demonstrate that the
ter is the generalized measure of association, which is well continuous global typicality does integrate to 1 exactly as the
defined in the realizations, but not all of the properties are traditional pdf (while being free form the restrictions the lat-
known in the random variables [8]. Therefore, there are advan- ter has). This is a new and significant result which makes
tages and disadvantages in each approach, but from a practical continuous global typicality an alternative to the pdf. This
point of view, the nonparametric approach is very appeal- strengthens the ability of the empirical data analysis (EDA)
ing because we can go beyond the framework of statistical framework for objectively investigating the unknown data pat-
reasoning to define new operators and still cross-validate the tern behind the data and opens up the framework for inference.
solutions with the available data using nonparametric hypoth- The methodology is exemplified with a naïve EDA classifier
esis tests. A good example is least squares versus regression. based on τ G .
One can always apply least squares to any data type, deter-
ministic, or stochastic. If the data is stochastic the solution is II. T HEORETICAL BASIS —D ISCRETE S ETS
called regression, but the result will be the same, because the
In this section, we start by presenting EDA foundations

n
autocorrelation function is a property of the data, independent
in discrete sets [12]–[14] for completeness and further clar-
of its type. The difference shows up only in the interpretation
ity. First, let us consider a real metric space RK and assume
of the solution; most importantly, the statistical significance of
a particular data set or stream {x}N = {x1 , x2 , . . . , xN } ∈ RK ;
the result can only be assessed using regression.

io
with xi = [xi,1 , xi,2 , . . . , xi,K ]T ; i = 1, 2, . . . , N, where sub-
A more recent alternative is to approximate the distribu-
scripts denote data samples (for a set) or the time instances
tions using nonparametric, data-centered functions, such as
when they arrive (for a stream). Within the data set/stream,
particle filters [9], entropy-based information theoretic learn-
some data samples may repeat more than once, namely,
ing (ITL) [5], etc. On the other hand, partially trying to
∃xi = xj , i = j. The set of sorted unique data samples,

rs
address the same problems, in 1965 Zadeh [10] introduced
denoted by {u}LN = {u1 , u2 , . . . , uLN } (where {u}LN ⊆ {x}N ,
fuzzy sets theory, which completely departed from objec-
1 < LN ≤ N) and the number of occurrence, denoted
tive observations and moved (similarly to the belief-based
by { f }LN = { f1 , f2 , . . . , fLN } can be determined automati-
theory [8] introduced a bit later) to the subjectivist definition
cally based on the data. With {u}LN and { f }LN , the primary
of uncertainty. A later strand of fuzzy set theory (data-driven
data set/stream {x}N can be reconstructed. In the remainder
Ve
approach developed mainly in 1990s) attempted to define the
of this paper, all the derivations are conducted in the Nth
membership functions based on experimental data. It stands
time instance except when specifically declared otherwise. The
in between probabilistic and fuzzy representations [11], how-
most obvious choice of RK , is the Euclidian space with the
ever, this approach requires an assumption on the type of
Euclidean distance, but we can also extend EDA definitions to
membership function. An important challenge is the poste-
Hilbert spaces, and reproducing kernel Hilbert spaces. We can,
rior distribution approximation. Approximate inference can be
moreover, consider different types of distances within these
done employing maximum a posteriori criteria which requires
spaces motivated by the purposes of the analysis that exploit
complex optimization schemes involving, for example, the
information available from the source that generated the sam-
expectation maximization algorithm [1]–[3].
ples or definitions that are appropriate for data analysis. Within
l

In this paper, we present a systematic methodology of


EDA, we introduce:
na

nonparametric estimators recently introduced in [12]–[14] for


1) cumulative proximity, q [12]–[14];
discrete sets using ensemble statistical properties of the
2) square centrality, q−1 ;
data derived entirely from the experimental discrete observa-
3) eccentricity, ξ [12]–[14];
tions and extend them to continuous spaces. These include
4) standardized eccentricity, ε [12]–[14];
the cumulative proximity (q), centrality (C), square centrality
5) discrete local density, D [12]–[14];
(q−1 ), standardized eccentricity (ε), density (D), as well as
6) discrete local typicality, τ [14];
typicality, (τ ) which can be extended to continuous spaces,
7) discrete global typicality, τ D [14];
Fi

resembling the information potential obtained from Parzen


8) continuous local density, DL ;
windows [1]–[4] in ITL [5]. Typicality sums up to 1 (while
9) continuous global density, DG ;
its continuous version integrates to 1) and is always positive;
10) continuous global typicality, τ G .
however, its values are always less than 1 unlike the pdf values
The discrete global typicality, τ D addresses the global prop-
that can be greater than 1. Additionally, the typicality is only
erties of the data and will be introduced in the next section.
defined for feasible values of the independent variable while
For inference, the continuous local (DL ), global density (DG ),
the pdf can extend to infeasible values, e.g., negative height,
and the continuous global typicality (τ G ) will be described in
distance, weight, absolute temperature, etc. unless specifically
detail in Section IV.
constraint [12]–[14]. We further consider discrete local (τ )
and global (τ D ) versions. Then, we introduce an automatic
procedure for identifying the local modes/maxima of τ D as A. Cumulative Proximity and Square Centrality
well as a procedure for reducing the amount of the local For every point xi ∈ {x}N ; i = 1, 2, . . . , N, one may want
maxima/modes and extend the nonparametric estimators to to quantify how close or similar this point is to all other
the continuous domain by introducing the continuous global data points from {x}N . In graph theory, centrality is used to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ANGELOV et al.: GENERALIZED METHODOLOGY FOR DATA ANALYSIS 3

indicate the most important vertices within a graph. A mea- Here, we also introduce standardized eccentricity, ε, which
sure of centrality [15], [16] is defined as a sum of distances does not decrease as fast as eccentricity with the increase of
from a point xi to all other points the amount of data, N and is calculated as follows:
1 2qN (xi )
cN (xi ) = N   ; xi ∈ {x}N ; 1 ≤ i ≤ N;LN > 1 εN (xi ) = NξN (xi ) = 1 N   ; LN > 1. (6)
j=1 d xi , xj N j=1 qN xj
(1) Based on the expression of the standard eccentricity
[namely, (6)] one can see that the data samples which are
where d(xi , xj ) is the distance/similarity between xi and xj ,
far away from the majority tend to have higher standard
which can be, but not limited to Euclidean, Mahalanobis,
eccentricity values compared with others. Thus, the stan-
cosine, etc.
dard eccentricity can serve as an effective measure of the
Its importance comes from the fact that it provides cen-
tail of data distribution without the need of clustering the
trality information about each data sample in a scalar or
data in advance. Combining the standard eccentricity with
vector form. We previously defined [12]–[14] the cumulative
the well-known Chebyshev inequality [17], which discribes
proximity qN (xi ) as

n
the probability that a certain data sample x is more than nσ

N
  (σ denotes the standard deviation) distance away from the
qN (xi ) = d2 xi , xj ; xi ∈ {x}N ; LN > 1 (2) mean, we get the EDA version of the Chebyshev inequality as
j=1 follows [12], [14]:

io
which can be seen as inverse centrality with a square distance   1
Cumulative proximity [12]–[14] is a very important associa- P εN (x) ≤ n2 + 1 ≥ 1 − 2 . (7)
n
tion measure derived empirically from the observed data with- The Chebyshev inequality expressed by the standard eccen-
out making any prior assumptions about their generation tricity provides a more elegant form for anomaly detection.

rs
model and plays a fundamental role in deriving other For example, if εN (x) > 10, x has exceeded the 3σ limitation,
EDA quantities. The complexity for computing the cumula- and can be categorized as an anomaly.
tive proximities of all samples in {x}N is O(N 2 ). As a result,
the computational complexity of other EDA quantities for C. Discrete Local Density
{x}N , which can be derived directly from cumulative prox-
Ve
imity is O(N). For many types of distance/similarity, i.e., Discrete local density is defined as the inverse of
Euclidean distance, Mahalanobis distance, cosine similarity, standardized eccentricity and plays an important role in
etc., with which the cumulative proximity can be calculated data analysis using EDA (i = 1, 2, . . . , N; LN > 1)
N   N N 2  
l=1 d xj , xl
recursively [14], the complexity for calculating the cumula- j=1 qN xj
−1 j=1
tive proximities of all the samples in {x}N is reduced to O(N) DN (xi ) = εN (xi ) = =  .
2NqN (xi ) 2N N l=1 d (xi , xl )
2
as well.
In a very similar manner, we can consider square centrality (8)
as the inverse of the cumulative proximity, defined as follows: For example, if the Euclidean distance is used, the density
1 can be expressed as (i = 1, 2, . . . , N; LN > 1)
l

q−1
N (xi ) = N   ; LN > 1. (3)
2 xi , xj 1
j=1 d DN (xi ) =
na

2
(9)
xi −µN
1+ XN −µTN µN
B. Eccentricity
The eccentricity, ξN , defined as a normalized cumula- where µN is the mean of {x}N ; XN is the mean of {xT x}N ; µN
tive proximity, is another very important association measure and XN can be updated recursively using [18]
derived empirically from the observed data without making 
µk = k−1
k µk−1 + k xk ;
1
µ1 = x 1 ;
any prior assumptions about their generation model [12]–[14]. k = 1, 2, . . . , N.
Xk = k Xk−1 + k xk xk ; X1 = xT1 x1 ;
k−1 1 T
Fi

It quantifies data samples away from the mode, useful to rep-


resent distribution tails and anomalies/outliers. It is derived by As we can see from (9), the discrete local density itself can
normalizing qN and taking into account all possible data sam- be viewed as a univariate Cauchy function while there is
ples. It plays an important role in anomaly detection [14] as no assumption or any predefined parameter involved in
well as for the estimation of the typicality as it will be detailed the derivation besides the definition of the distance function
below. The eccentricity (ξN ) of a particular data sample xi in (Euclidean distance used here).
the set {x}N (LN > 1) is calculated as follows [12]–[14]:
 D. Discrete Local Typicality
2qN (xi ) 2 N l=1 d (xi , xl )
2
ξN (xi ) = N   = N  N   (4) Discrete local typicality was first introduced in [13], and
l=1 d xj , xl
2
j=1 qN xj j=1 called unimodal typicality. In this paper, it is redefined as the
where the coefficient 2 is included to normalize eccentricity normalized local density (i = 1, 2, . . . , N; LN > 1)
between 0 and 1, that is DN (xi ) q−1 (xi )
τN (xi ) = N   = N N −1   . (10)
0 ≤ ξN (xi ) ≤ 1. (5) j=1 DN xj j=1 qN xj
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

n
(a) (b)

Fig. 1. (a) Histogram and (b) discrete global typicality τ D of the real climate data [19] using Euclidean distance.

io
The discrete local typicality resembles the traditional uni- for the traditional probability by definition and according to
modal probability mass function (pmf), but it is automatically the central limit theorem [1]–[3]. In reality, data distributions
defined in the data support unlike the pmf which may have are usually multimodal [20]–[23], therefore the local descrip-

rs
nonzero values for infeasible values of the random variable tion should be improved. In order to address this issue, the
unless specifically constraint. traditional probability theory often involves mixture of uni-
The discrete local density resembles membership functions modal distributions, which requires estimation of number of
of a fuzzy set having value of 1 for x = µ while the discrete modes and it is not easy [23]. Within the EDA framework,
local typicality resembles pmf with the sum of NτN values we provide the discrete global typicality, τ D , directly from
Ve
being equal to 1 and values for both D and τ being from the the dataset, which provides multimodal distributions automat-
interval [0, 1]. ically without the need of user decisions and only requires
As an example, the square centrality, standardized eccen- a threshold for robustness against outliers.
tricity, discrete local density and typicality of real climate The discrete global typicality of a unique data sample
dataset (wind chill and wind gust) measured in Manchester, is expressed as a combination of the normalized discrete
U.K. for the period 2010–2015 [19] are presented in Fig. 1 in local density weighted by the corresponding frequency of
the supplementary material. In these examples, Euclidean occurrence of this unique data sample (i = 1, 2, . . . , LN ;
distance is used. LN > 1)
fi DN (ui ) fi q−1 (ui )
l

III. T HEORETICAL BASIS : D ISCRETE τND (ui ) = L   = L N −1   (11)


na

N N
G LOBAL T YPICALITY j=1 fj DN uj j=1 fj qN uj
In this section, we will consider the more realistic case where q−1N (ui ) and DN (ui ) are the square centrality and the
when data distributions are multimodal. Traditionally, this discrete local density of a particular data sample, ui calculated
requires identifying local peaks/modes by clustering, expec- from {u}LN only.
tation maximization, optimization, etc. [1]–[3], [20]–[22]. This expression is very fundamental, because, in fact, it
Within EDA, the discrete global typicality (τ D ) is derived combines information about repeated data values and the scat-
Fi

automatically from the data with no user input and can tering across the data space, and resembles the well-known
quantify multimodality. It is based on the local cumulative membership functions of fuzzy sets. We further explain this
proximity, square centrality, eccentricity, and standardized link in [24].
eccentricity. The only requirements to define the discrete One can easily appreciate from Fig. 1, the differences
global typicality are the raw data and the type of distance between the τ D and histogram with a quantization step equal
metric (which can be any). to 5 for both dimensions. Note that, the histogram requires the
selection of one parameter (the quantization step) per dimen-
A. Discrete Global Typicality sion, while none is needed for the discrete global typicality.
Expressions (9) and (10) provide definitions of local oper- For large dimensions, this can be a big problem. The size of
ators that are very appropriate to quantify the peak point (x∗ ) the grid/axis is a user-specified parameter. The histogram takes
of unimodal discrete functions. Moreover, if the peak coin- only values from a finite set {0; 1; 2; · · · ; N}, while τND can
cides with the global mean µN (x∗ = µN ), then the value of take any real value.
the local density is equal to 1: DN (µN ) = 1. A similar prop- The discrete global typicality has the following properties.
erty having a maximum, though its value is < 1, is also valid 1) Sums up to 1.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ANGELOV et al.: GENERALIZED METHODOLOGY FOR DATA ANALYSIS 5

n
(a) (b)

Fig. 2. Histogram and discrete global typicality for the unique data samples. (a) Histogram with very small quantization. (b) Discrete global typicality, τ D .

io
2) The value is within [0, 1]. outcome of throwing dices 100 times be { f }6 =
3) Provides a closed analytic form, (11). {17; 14; 15; 15; 21; 18}, the values of the discrete global
typicality τ D of the six outcomes are equal to their corre-

rs
4) There is no requirement for prior assumptions as well as
any user- or problem-specific threshold and parameters. sponding frequencies, see the Fig. 3 in the supplementary
5) It is free from some peculiarities of traditional probabil- material.
ity theory (its value never gets > 1 and nonzero positive
for infeasible values [14]).
B. Identifying Local Modes of Discrete Global Typicality
Ve
6) It can be recursively calculated for various types of
metrics. In this section, an automatic procedure for identifying all
When all the data samples in the dataset have different val- local maxima of the discrete global typicality, τND defined
ues (fi = 1; ∀i), and the histogram quantization step parameter in the previous section will be described. It results in the
is not properly set, the histogram is unable to show any use- formation of data clouds (samples associated with the local
ful information, while the discrete global typicality can still maxima) [18], [25]. Data clouds are shape-free, while clusters
show the mutual distribution information of the dataset, see are usually hyper-spherical, hyper-ellipsoidal. This data parti-
Fig. 2(a) and (b). This is a major advantage of discrete global tioning resembles Voronoi tessellation [26]. They are also used
typicality because it is parameter free. Here, the figures are in the AnYa type neuro-fuzzy predictors [18], [25], classifiers
l

based on the unique data samples of the same climate dataset. and controllers.
The illustrative figures in this section are based on the same
na

As we can see, the data samples which are closer to the mean
of the dataset will have higher value of global typicality and climate dataset [19] that was used earlier in Fig. 1, which has
vice versa. two features/attributes: wind chill (◦ C) and wind gust (mph). In
It is also interesting to notice that for equally distant all cases, the Euclidean distance is used, though, the principle
data, the discrete global typicality, τND is exactly the same is valid for any metric.
as the frequentistic The proposed Algorithm 1 can be summarized as follows.
 Nform of probability. Then (11) reduces Step 1: Identifying the global maximum of the discrete
to τND (ui ) = fi / Lj=1 fj . Fig. 2 in the supplementary mate-
global typicality τND .
Fi

rial shows a simple example of the discrete global typicality


τND and pdf of an artificial climate dataset {x}50 with only For every unique data sample of the dataset {x}N , its discrete
data of wind chill, which have two unique data samples, global typicality τND (ui ) (i = 1, 2, . . . , LN ) can be calculated
{u}2 = {10; 20} (◦ C), while { f }2 = {20; 30}. using (11).
Obviously, q50 (u1 ) = q50 (u2 ) = d2 (u1 , u2 ), and The data sample with the highest τND is selected as the
τ50 (10 ◦ C) = 0.4; τ50
D D (20 ◦ C) = 0.6. Indeed, if 20 times we reference data sample in the ranked collection {u∗ }LN
observe wind chill is 10 ◦ C and 30 times 20 ◦ C the likelihood   
u∗(1) = arg max τND uj (12)
for wind chill of 10 ◦ C will be 40% and for wind chill of j=1,2,...,LN
20 ◦ C will be 60%, respectively.
The discrete global typicality τ100 D of the outcome of where u∗(1) is the data sample with the highest value of dis-
throwing dices for 100 times is presented in Fig. 3 in the crete global typicality (in fact, the global maximum), and we
supplementary material as an additional illustrative example. set u∗m ← u∗(1) . In case when there are more than one
In this experiment, for 1, we can use [1; 0; 0; 0; 0; 0; ]T , maxima, we can start with any one of them.
for 2, we can use [0; 1; 0; 0; 0; 0; ]T , etc. Let the Step 2: Ranking the discrete global typicality τND .
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

n
(a) (b)

Fig. 3. Identifying local maxima of the discrete global typicality, τ D . (a) Ranked discrete global typicality τ D . (b) Local maxima/peaks/modes of τ D .

io
Then, we find the unique data sample that is nearest to Step 5: Selecting the main local maxima of the discrete
u∗m denoted by u∗(2) and put it into {u∗ }LN , meanwhile, global typicality τND .
remove it from {u}LN . u∗(2) is set to be the global maximum We then calculate τND at the data cloud centers, denoted
u∗m ← u∗(2) .
The ranking operation continues by finding the next
data sample, which is closest to u∗m , putting it into {u∗ }LN ,
removing it from {u}LN and setting it as the new global
maximum. rs by {µ}N using (11) with the corrpesonding supports as their
frequencies. Then, we use the following operation to take out
the less prominent local maxima.
For each center µiN , we check the condition (i, j =
1, 2, . . . , PN ; i = j)
Ve
By applying the ranking operation until {u}LN becomes
      
empty, we can finally get the ranked unique data samples, j j
IF µiN − µN ≤ 2σNi AND τND µiN < τND µN
denoted as {u∗ }LN = {u∗(i) |i = 1, 2, . . . , LN } and their  
corresponding ranked discrete global typicality collection: THEN {µ}R ← µiN . (15)
{τND (u∗ )}N = {τND (u∗(1) ), τND (u∗(2) ), . . . , τND (u∗(LN ) )}.
Step 3: Identifying all local maxima. This condition means that if there is another center with
The ranked discrete global typicality is filtered using (13) higher τND located within the 2σNi area of µiN , this less
to detect all local maxima of τND prominent center is replaceable. This condition guarantees
         
l

that the influence areas of neighboring data clouds will not


IF τND u∗(j−1) < τND u∗(j) AND τND u∗(j) > τND u∗(j+1)
na

  overlap significantly (it is well known that according to the


THEN u∗(j) is a local maximum of τND . (13) Chebyshev inequality for arbitrary distribution the majority
of the data samples (>75%) lie within 2σ distance from the
We denote the set of the local maxima (can be used as mean [1]–[3]).
a basis for forming data clouds and, further, AnYa type fuzzy By finding out all the centers satisfying the above condition
rule-based models [18], [25]) of τND as the set {u∗∗ }PN = and assigning them to {µ}R , we get the filtered data cloud
∗(j)
{u∗∗(j) | j = 1, 2, . . . , PN }; PN is the number of the identified centers denoted by {µ∗ }P∗N = {µN |j = 1, 2, . . . , P∗N ; P∗N ≤
Fi

local maxima and PN ≤ LN . PN } by excluding {µ}R from {µ}PN ({µ∗ }P∗N ∪ {µ}R = {µ}PN
The ranked discrete global typicality is depicted in Fig. 3(a), and {µ∗ }P∗N ∩{µ}R = Ø), where P∗N is the number of remaining
the corresponding local maxima are depicted in Fig. 3(b). centers.
Step 4: Forming data clouds. After that, we set {u∗∗ }PN ← {µ∗ }P∗N , PN ← P∗N and repeat
Each local maximum, u∗∗(i) , is then set as a prototype of steps 4 and 5 until the data cloud centers do not change any
a data cloud. All other data points are assigned to the nearest more.
prototype (local maximum) forming data clouds using Finally, we can get the composed result, renamed as {µ◦ },
   and use the {µ◦ } as the prototypes to build data clouds
winning label = arg min d x, u∗∗(j) . (14) using (14).
j=1,2,...,PN
The final data cloud centers for each selection round is
Data clouds can be used to form AnYa models [18], [25]. presented in the video in the supplementary material, which
After all the data samples within {x}N are assigned to the can also be downloadable from [27].
j j
data clouds, the center (mean) µN , the standard deviation σN The final result is presented in Fig. 4. Compared with
j
and support SN (j = 1, 2, . . . , PN ) per cloud can be calculated. Fig. 3(b), in the final round, there are only two main
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ANGELOV et al.: GENERALIZED METHODOLOGY FOR DATA ANALYSIS 7

Algorithm 1 Automatic Mode Identification Algorithm


i. Calculate τND (ui ), i = 1, 2, . . . , LN using equation (11);
ii. Find the unique data sample u∗(1) with global maximum of τND
using equation (12);    
iii. Send u∗(1) into u∗ L and τND u∗(1) into τND u∗ L
N N
and delete u∗(1) from u∗ L ;
N
iv. u ← u∗(1) ;
∗m
v. While {u}LN  = ∅
* Find the unique data sample which is nearest to u∗m ;
* Send the data sample and the corresponding τND
into u∗ L and τND u∗ L ;
N N
* Delete the data sample from {u}LN ;
* Set the latest element in u L as u∗m ;

N
vi. End While  ∗
∗ D
vii. Filter u P and τN u P using equation (13) and
N N

n
obtain u∗∗ P as centers of data clouds;
N
Fig. 4. Final filtering result (the black “*” denotes the centers of the viii. While u∗∗ P are not fixed
N
data clouds, the data samples from different data clouds are plotted with
* Use u∗∗ P and form the data clouds from {x}N using
different colors). N

io
equation (14);
modes left broadly corresponding to the two main seasons * Obtain the new centers {µ}PN , standard deviations
in Northern England and all the details are filtered out. {σ }PN and supports
  {S}PN of the data clouds;
j
Even if fi = 1, ∀i, the discrete global typicality can still * Calculate τND µN , j = 1, 2, . . . , PN using equation (11);
be extracted successfully from the data samples, despite the * Find {µ}R satisfying equation (15);
* Exclude {µ}R from {µ}PN and obtain µ∗ P∗ ;

rs
fact that the result may not be exactly the same because of N
the changing data structure, see Fig. 4 in the supplementary ∗ u∗∗ P ← µ∗ P∗ ;
N
material, which uses the same real climate dataset in Fig. 4. N

The summary of Algorithm 1 is as follows. ∗ PN ← P∗N ;


ix. End While
Ve
C. Properties of EDA Operators x. µ◦ ← u∗∗ P ;
N
xi. Build the data clouds with µ◦ using equation (14);
Having introduced the basic EDA operators, we will now
outline their properties.
1) They are entirely based on the empirically observed
experimental data and their mutual distribution in the observations. However, they cannot be used for inference
data space. because they are only defined on points, where samples occur
2) They do not require any user- or problem-specific (discrete spaces). In this section, we define the continuous
thresholds and parameters to be prespecified. local and global density and global typicality which can be
used for inference on the continuous domain of the variable
l

3) They do not require any model of data generation (ran-


dom or deterministic), only the type of distance metric x. At this stage, we depart from the entirely data based and
na

used (however, it can be any). assumptions-free approach we used so far, however, this is
4) The individual data samples (observations) do not need done after we identified the local modes, formed data clouds
to be iid; on the contrary, their mutual dependence is around these focal points and obtained the supports of these
taken into account directly through the mutual distance data clouds. Therefore, the extension to the continuous domain
between the data samples. is inherently local (per data cloud). We assume that the local
5) The method does not require infinite number of obser- mode considered as the mean and the support considered as
Fi

vations and can work with just a few exemplars. frequency plus the deviation of the empirical data do provide
Within EDA, we still can consider cross validation and the triplet of parameters (µ, X, S). We do recognize that these
nonparametric statistical tests based on the realizations of triplets are conditional on the specific S data samples observed
experimentally observed data similarly to the significance tests and associated with the particular data cloud, but this will be
utilized on the random variables assumed in the traditional updated when new data is available. Now, having this triplet
probability theory and statistics. As a conclusion, EDA can be of parameters we, first, define the continuous local density,
seen as an advanced data analysis framework which can work DL as
efficiently with any feasible data and any type of distance or N  
similarity metric. j=1 qN xj
DN (x) =
L
; LN > 1. (16)
2NqN (x)
IV. T HEORETICAL BASIS —C ONTINUOUS Like (9), for the case of Euclidean distance, the continuous
D ENSITY AND T YPICALITY local density, DL is simplified to a continuous Cauchy type
Up to this point, all EDA definitions are useful to describe function over any feasible value of the variable x with the
data sets or data streams made up of a discrete number of parameters µ and X extracted from S available data samples
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

where DLN,i (x) is the local density of x in the ith data cloud;
CN is the number of data clouds at the Nth time instance;
SN,i is the support (number of members) of the ith data cloud
based on the available experimental/actual
 N data. For normaliza-
Fig. 5. Process of extracting distribution from data in EDA. tion, we impose the condition C i=1 SN,i = N. The continuous
global density DG is defined nonparametrically from each of
the modes of the data (DL ) and near the peaks; it is a very
as described earlier good approximation of DL , but it will deviate progressively
from it in trough regions. As an example, the global density
1
DLN,i (x) = 2
; i = 1, 2, . . . , CN ; LN > 1 for the same climate dataset used before [19] is presented in
x−µN,i
1+ Fig. 6(a).
σN,i
2
Compared with the discrete local density introduced in
(17) Section II which is discrete and unimodal by definition, DG is
more effective to detect the natural multimodal data structure,
2 = X
where σN,i N,i − µN,i ; µN,i and XN,i are the mean
2

n
such as abnormal data samples because only the data samples
and the average value of scalar products of the data samples
that are close to the larger data clouds, which can be viewed
within the ith data cloud; CN is the number of data clouds;
as the main modes of the data patterns, can have higher values
the subscript N means the local densities are derived from N
of continuous global density. This feature is clearly depicted

io
observed data samples. It is obvious that with more data sam-
by the value of DG of those data samples located in the space
ples observed, the parameters will change and have to be
between the two main modes in the figures below, while for
updated regularly. Note that (17) is defined based on Euclidean
the local density, see Fig. 1(c) in the supplementary material,
distance. The expression of continuous local density DL varies
it is exactly the opposite case.
from the type of distance used. Nonetheless, in general,
the continuous local density of the data can be expressed
in the same form as the discrete local density but in the
continuous space.
The continuous local density DL is defined on the con-
tinuous space for each local maximum per data cloud. rs B. Continuous Global Typicality
Having introduced the continuous global density, we can
also define the continuous global typicality, τ G as well. It is
also defined as a normalized form of the density [similarly
Ve
Furthermore, we introduce the continuous global density DG to the discrete global typicality, τ D , (11)] but with the use of
as a weighted sum of the local density of each data cloud integral instead of the sum. As stated in Section II, the discrete
with weight being the support (number of data samples) of global typicality, τ D is discrete and sums to 1. The continuous
the respective data cloud. Finally, we introduce the continu- global typicality is expressed as follows:
ous global typicality τ G based on DG . The continuous global CN
DG N (x)
SN,i DLN,i (x)
density and typicality play a similar role to the mixture of τN (x) = ∞ G
G
= C i=1 ∞ .
−∞ DN (x)dx i=1 SN,i −∞ DN,i (x)dx
N L
pdfs. However, the questions “how many distributions in the
mixture,” “which are their parameters” and “what type of dis- (19)
tributions” see Fig. 5 are all answered from the data directly,
l

It is important to notice that (19) is general and valid for any


free from any user or problem-specific predefined parameters,
na

type of distance/similarity metric. For a general multivariate


prior assumptions, knowledge or preprocessing techniques like
case, it is important to normalize the mixture of continu-
the cases of clustering, EM, etc.
ous local densities DLN,i (x) to make τ G integrate to 1. By
finding out the integral of the continuous global density
A. Continuous Global Density within the metric space and dividing DG by its integral, one
Continuous global density is a mixture that arises simply can always guarantee unit integral, regardless the type of
from the metric of the space used to measure sample distance distance/similarity metric used.
Fi

and the density of samples that exist in the space. However, it As we said before, we use Euclidean distance for example
works for all types of distance/similarity metric. As we can see and consider the well-known expression of the multivariate
from (17) the local density is Cauchy type when the Euclidean Cauchy distribution [20]–[22] to transform the DLN,i (x) without
distance is employed therefore, the simplest of the procedures loss of generality
 
is to define the continuous global density as a mixture of  K+1
Cauchy distributions. The continuous global density enables 2
f (x) =   K+1 (20)
inference of new samples anywhere in the space. K+1 (x−µ)T (x−µ) 2
π 2 σ 1+
K
σ2
For any x and any type of distance used, we define con-
tinuous global density in a general form very much like the where x = [x1 , x2 , . . . , xK ]T ; π is the well-known mathemat-
mixture distributions, as a weighted combination of continuous ical constant and (·) is the gamma function; µ = E[x]; σ is
local densities scalar parameter. This guarantees that
CN
SN,i DLN,i (x)
DN (x) = i=1
G
; LN > 1 (18) ··· f (x1 , x2 , . . . , xK )dx1 dx2 , . . . , dxK = 1. (21)
N xK x2 x1
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ANGELOV et al.: GENERALIZED METHODOLOGY FOR DATA ANALYSIS 9

n
(a) (b)

Fig. 6. Continuous (a) global density and (b) global typicality of the real climate dataset [19] using Euclidean type distance.

io
Based on (18)–(20), we introduce the normalized continuous In summary, the proposed continuous global typicality has
local density as follows: the following properties, many of which it shares with the
  discrete global typicality introduced in Section III.
 K+1

rs
2  L  K+1 1) Integrates to 1.
D̄LN,i (x) = K+1 DN,i (x) 2 . (22) 2) Provides a closed analytic form.
π 2 σN,i
K
3) No requirement for prior assumptions as well as any
 user or problem-specific threshold and parameters; these
T
Here, σN,i = XN,i − µN,i µN,i for the Euclidean distance. are derived from the data entirely.
Ve
We can, finally, get the expression of the continuous global 4) Can be recursively calculated for various types of
typicality, τ G in terms of the normalized continuous global metrics.
density as
V. A PPLICATIONS
CN
i=1 SN,i D̄N,i (x)
L A. Examples
τNG (x) = CN ∞ In this section, we will give several examples of the contin-
i=1 SN,i −∞ D̄N,i (x)dx
L
  uous global typicality, τ G of different datasets extracted by
 L  K+1
 K+1
2 
CN
DN,i (x) 2 the proposed automatic mode identification algorithm. The
= SN,i . (23) continuous global typicality of the seeds dataset [28] and
π
K+1
N σN,i
K
l

2
i=1 combined cycle power plant dataset [29], and wine quality
na

For the Euclidean distance, (23) becomes dataset [30] with Euclidean distance is presented in Fig. 8. As
the dimensionality of the original datasets is >2, for a better
 
 K+1 visualization, we use the principal component analysis (PCA)
2 
CN
SN,i
τNG (x) = method [31] to reduce the dimensionality and use the first
K+1   K+1 . (24)
π 2 N i=1 K x−µN,i
2 2 two principal components in the figures as the x- and y-axes.
σN,i 1 + Fig. 5(a) and (b) in the supplementary material present the
2 σN,i
τ G derived from the first 1/3 and the first 2/3 the wine qual-
Fi

The continuous global typicality of the real climate dataset ity dataset. Fig. 5(c) in the supplementary material depicts
with Euclidean distance is presented in Fig. 6(b). the τ G derived by scrambling the order of the data sam-
The comparisons between the continuous global typical- ples. The continuous global typicality τ G of 2-D benchmark
ity (the modes are extracted by the approach introduced in datasets A1, S1, and S2 [32] are also presented in Fig. 6 in
Section III), discrete global typicality, histogram (normalized) the supplementary material.
and traditional pdf are presented in 2-D form for visual clarity If we want more details from the continuous global typ-
in Fig. 7 using the same the real climate dataset [19]. icality, we can also stop the automatic mode identification
As shown in Fig. 7, compared with the traditional pdf using algorithm described in Section III early, i.e., before the final
a Gaussian model, the global typicality derived directly from iteration, and build the continuous global typicality based
the dataset without any prior assumption about the number of on more detailed data partitioning results. The video in the
local modes or type of distribution represents very well the supplementary material referred in Section III-B also depicts
two modes in the data pattern and gives results very close to evolution of the global continuous typicality based on the
what a histogram would give and significnatly better to what results of different iteration times of the proposed mode
a single unimodal distribution would provide. identification algorithm.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

n
(a) (b)

Fig. 7. Comparison between the continuous global typicality τ G , discrete global typicality τ D , histogram (normalized) and traditional pdf. (a) Wind chill

io
(◦ C). (b) Wind gust (mph).

rs
Ve
(a) (b) (c)

Fig. 8. Continuous global typicality of the (a) seeds dataset [28], (b) combined cycle power plant dataset [29], and (c) wine quality dataset [30].

B. Inference Primer Let us continue the example in Fig. 9. If we want to know


Assuming, there are three arbitrary noninteger values the global continuous typicality of all the data samples above
of wind chill data x = {−7.5; 2.5; 14.7} (◦ C), which 20 ◦ C, which is the green area of this figure, we can calculate
l

does not exist in the dataset, we can quickly obtain the value using (26) to yield T(x > 20) = 0.2447. That means
na

the corresponding continuous global typicality using (19), that the likelihood, a value to be equal to or greater than 20 ◦ C
{τ G (x)} = {0.0080, 0.0375, 0.0180} and the inferences made is 24.47%. One can see that the continuous global typicality
are presented in Fig. 9. Here, we only consider the two main can serve as a form of probability.
modes. That means that wind chill of −7.5 ◦ C is less likely
while the wind chill of 2.5 ◦ C is more likely. C. Naïve EDA Classifier
In addition, if we want to know the continuous global typ- In this section, we borrow the concept of naïve Bayes
Fi

icality of all the values larger than t, we can integrate as classifiers [1]–[3] and propose a new version of naïve
follows: EDA classifier. In contrast with the original naïve EDA classi-
t
fier proposed in [14], which relies for inference on the discrete
T(x ≥ t) = 1 − τNG (x)dx. (25) global typicality and linear interpolation and/or extrapolation,
x=−∞ the naïve EDA classifier in this paper uses the continuous
For example, when Euclidean distance is used, and here global typicality instead, which is based on the local modes
we only consider 1-D data for simpler derivation, (25) can be of the discrete global typicality identified by an automatic
rewritten as procedure as described in Section III-B. This procedure is
t CN more effective in reflecting the ensemble features of the
i=1 SN,i D̄N,i (x)
L
T(x ≥ t) = 1 − dx distribution of the data samples of different classes in the
N data space.
x=−∞
CN     As the proposed approach accommodates various type of
t−μN,i
i=1 SN,i
1
π arctan σN,i + 1
2 distance/similarity metrics, one can use the current knowl-
= 1− . (26) edge in the area to choose the desired distance measure for
N
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ANGELOV et al.: GENERALIZED METHODOLOGY FOR DATA ANALYSIS 11

TABLE I
C LASSIFICATION P ERFORMANCE -3 P RINCIPAL
C OMPONENTS C ONSIDERED

n
TABLE II
C LASSIFICATION P ERFORMANCE -5 P RINCIPAL
Fig. 9. Continuous global typicality τ G of wind chill data and simple C OMPONENTS C ONSIDERED
inferences.

io
a reasonable approximation that simplifies the processing.
Moreover, one can change to other distance measures eas-
ily and compare the results obtained by the classifier with

rs
different type of measures. For consistence, in the following
numerical examples, we use the Euclidean distance.
Let us assume H classes at the Nth time instance, where
some classes may have many data clouds. The continuous
global typicality per class can be defined as (i = 1, 2, . . . , H)
Ve
 Wi
j=1 SN,i,j DN,i,j (x)
L
τN,i (x) = W
G
∞ (27) In the experiments, PCA [31] is applied as a preprocessing
j=1 SN,i,j −∞ DN,i,j (x)dx
i L
step to reduce the dimensionality and balance the variances of
where, Wi is the the datasets. It has to be stressed that PCA is not a part of the
Hnumber of data clouds sharing the same proposed method and is not necessary for simpler problems.
ith class label, i=1 Wi = CN ; SN,i,j is the support of the
jth data cloud having the ith class label; DLN,i,j (x) is the For banknote authentication, pima and climate datasets, we
corresponding continuous local density. randomly select 70% of the data for training and use the rest
For any unlabeled data sample x, its label is decided by the for validation. The performance is evaluated after ten Monte
l

following expression: Carlo experiments. For pen-based digits, Madelon, optical dig-
  its, and occupancy detection datasets, we train the classifiers
na

label(x) = arg max τN,j


G
(x) . (28) with the training sets and conduct the validation with the
j=1,2,...,H testing/validation sets.
The 2-D plots (wind chill and wind gust) of the continu- The overall performance of the three classifiers is tabulated
ous global typicality with Euclidean type of distance of the in Table I, where we consider the first three principal com-
real climate dataset are given in Fig. 7 in the supplementary ponents for classification. Considering the first five principal
material. components, the overall results obtained by the classifiers are
Fi

The performance of the proposed naïve EDA classifier is tabulated in Table II.
further tested on the following problems. As it is shown in Tables I and II, the proposed naïve
1) Banknote authentication dataset [33]. EDA classifier outperforms the SVM classifier and naïve
2) Pima dataset [34]. Bayes classifier on different problems in the majority of the
3) Climate dataset [19]. numerical examples. The performance of the proposed naïve
4) Pen-based handwritten digits recognition dataset [35]. EDA classifier is the best. In addition, it is worth to note
5) Madelon dataset [36]. that the classification conducted by the naïve EDA classifier is
6) Optical handwritten digits recognition dataset [37]. totally free from unrealistic assumptions, restrictions or prior
7) Occupancy detection dataset [38]. knowledge.
The proposed naïve EDA classifier is compared with an
SVM classifier with Gaussian radial basis function and a naïve VI. C ONCLUSION
Bayes classifier in terms of their performance. The details In this paper, we propose a new systematic approach
of the datasets used in the classification are demonstrated in to derive ensemble properties of data without any prior
Section B in the supplementary material. assumptions about data sources, amount of data and user- or
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CYBERNETICS

problem-specific parameters. The EDA framework considers [7] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30,
the relative position of data in a metric space only and extracts nos. 1–2, pp. 81–93, 1938.
[8] L. A. Goodman and W. H. Kruskal, “Measures of association for cross
from the raw experimental discrete observations a series of classifications,” J. Amer. Stat. Assoc., vol. 49, no. 268, pp. 732–764,
measures of their ensemble properties, such as the cumulative 1954.
proximity (q), centrality (C), square centrality (q−1 ), standard- [9] P. Del Moral, “Nonlinear filtering: Interacting particle resolution filtrage
non-linéaire par systèmes de particules en interaction,” Comptes Rendus
ized eccentricity (ε), density (D) as well as typicality, (τ ). The l’Académie des Sci. I Math., vol. 325, no. 6, pp. 653–658, 1997.
local and global versions of the typicality, (τ and τ G ) are both [10] L. A. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, no. 3, pp. 338–353, 1965.
considered originally in discrete form and then in continuous [11] M.-Y. Chen and D. A. Linkens, “Rule-base self-generation and simpli-
form approximating the actual data-driven discrete estimators fication for data-driven fuzzy models,” Fuzzy Sets Syst., vol. 142, no. 2,
pp. 243–265, 2004.
by a mixture of local functions. It was demonstrated that for [12] P. P. Angelov, “Anomaly detection based on eccentricity analysis,” in
the case when the distance metric used is Euclidean, the den- Proc. IEEE Symp. Series Comput. Intell., 2014, pp. 1–8.
sity (both in its discrete form that is exactly describing the [13] P. Angelov, “Outside the box: An alternative data analytics framework,”
J. Autom. Mobile Robot. Intell. Syst., vol. 8, no. 2, pp. 53–59, 2014.
actual data and in its continuous form which is approximat- [14] P. Angelov, X. Gu, and D. Kangin, “Empirical data analytics,” Int.
ing the entire data space density) takes the form of a Cauchy J. Intell. Syst., 2017, doi: 10.1002/int.21899.

n
function. However, importantly, this is not an assumption made [15] G. Sabidussi, “The centrality index of a graph,” Psychometrika, vol. 31,
no. 4, pp. 581–603, 1966.
a priori, but is driven and parameterized by the data and the
[16] L. C. Freeman, “Centrality in social networks conceptual clarification,”
selected metric. Furthermore, we propose an autonomous algo- Soc. Netw., vol. 1, no. 3, pp. 215–239, 1979.
rithm for identifying all local modes/maxima of the global [17] J. G. Saw, M. C. K. Yang, and T. C. Mo, “Chebyshev inequality with

io
discrete typicality, τ D as well as for filtering out the main estimated mean and variance,” Amer. Stat., vol. 38, no. 2, pp. 130–132,
1984.
local maxima based on the 2σ closeness of each local max- [18] P. Angelov, Autonomous Learning Systems: From Data Streams to
imum. Finally, we present a number of numerical examples Knowledge in Real Time. Chichester, U.K.: Wiley, 2012.
aiming to verify the methodology and demonstrate its advan- [19] Climate Dataset in Manchester. Accessed: Jul. 10, 2017. [Online].
Available: http://www.worldweatheronline.com

rs
tages. We introduce a new type of classifier, which we call [20] S. Nadarajah and S. Kotz, “Probability integrals of the multivariate t
naïve EDA for investigating the unknown data pattern behind distribution,” Can. Appl. Math. Quart., vol. 13, no. 1, pp. 53–84, 2005.
the large amount of data in a data-rich environment. In the [21] C.-Y. Lee, “Fast simulated annealing with a multivariate Cauchy distri-
conclusion, the proposed EDA framework and methodology bution and the configuration’s initial temperature,” J. Korean Phys. Soc.,
vol. 66, no. 10, pp. 1457–1466, 2015.
provides an efficient alternative that is entirely based on the [22] S. Y. Shatskikha, “Multivariate Cauchy distributions as locally Gaussian
Ve
experimental data and the evidence. It touches the very foun- distributions,” J. Math. Sci., vol. 78, no. 1, pp. 102–108, 1996.
dations of data mining and analysis and, thus, has a wide [23] A. Corduneanu and C. M. Bishop, “Variational Bayesian model selection
for mixture distributions,” in Proc. 8th Int. Conf. Artif. Intell. Stat., 2001,
area of applications, especially, in the era of big data and pp. 27–34.
data streams, where handcrafting offline methods and making [24] P. P. Angelov and X. Gu, “Empirical fuzzy sets,” Int. J. Intell. Syst.,
detailed assumptions is often not an option. 2017, doi: 10.1002/int.21935.
[25] P. Angelov and R. Yager, “A new type of simplified fuzzy rule-based
Nonetheless, we have to admit that the bottlenecks of the system,” Int. J. Gen. Syst., vol. 41, no. 2, pp. 163–185, 2011.
proposed methodology are the lack of theoretical confidence [26] A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu, Spatial Tessellations:
levels for the analysis and the theoretical idea of reliabil- Concepts and Applications of Voronoi Diagrams, 2nd ed. Chichester,
ity and generalization, which are the inherited limitations of U.K.: Wiley, 1999.
[27] Supplementary Video. Accessed: Jul. 10, 2017. [Online]. Available:
l

nonparametric approaches. https://www.dropbox.com/s/q34iyc6acrx85ou/Video_TCBpaper.wmv?


na

In this paper, we only provide the preliminary algorithms dl=0


and results on data partitioning, analysis, inference, and clas- [28] Seeds Dataset. Accessed: Jul. 10, 2017. [Online]. Available:
https://archive.ics.uci.edu/ml/datasets/seeds
sification. As a future work, we will focus on developing [29] Combined Cycle Power Plant Dataset. Accessed: Jul. 10, 2017.
more advanced algorithms within the EDA framework for [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Combined+
various applications of different areas, including, but not Cycle+Power+Plant
[30] Wine Quality Dataset. Accessed: Jul. 10, 2017. [Online]. Available:
limited to, high frequency trading data processing, foreign cur- https://archive.ics.uci.edu/ml/datasets/Wine+Quality
rency trading problem, handwritten digits recognition, remote [31] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA:
Fi

sensing, etc. Springer, 2002.


[32] Clustering Datasets. Accessed: Jul. 10, 2017. [Online]. Available:
http://cs.joensuu.fi/sipu/datasets/
R EFERENCES [33] Banknote Authentication Dataset. Accessed: Jul. 10, 2017.
[Online]. Available: https://archive.ics.uci.edu/ml/datasets/banknote+
[1] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical authentication
Learning: Data Mining, Inference, and Prediction. New York, NY, USA: [34] Pima Indians Diabetes Dataset. Accessed: Jul. 10, 2017. [Online].
Springer, 2009. Available: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
[2] C. M. Bishop, Pattern Recognition and Machine Learning. New York, [35] Pen-Based Recognition of Handwritten Digits Dataset. Accessed:
NY, USA: Springer, 2006. Jul. 10, 2017. [Online]. Available: http://archive.ics.uci.edu/
[3] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
Chichester, U.K.: Wiley, 2000. [36] Madelon Dataset. Accessed: Jul. 10, 2017. [Online]. Available:
[4] T. Bayes, “An essay towards solving a problem in the doctrine of http://archive.ics.uci.edu/ml/datasets/Madelon
chances,” Philosoph. Trans. Roy. Soc., vol. 53, pp. 370–418, 1763. [37] Optical Recognition of Handwritten Digits Dataset. Accessed:
[5] J. C. Principe, Information Theoretic Learning: Renyi’s Entropy and Jul. 10, 2017. [Online]. Available: https://archive.ics.uci.edu/ml/
Kernel Perspectives. New York, NY, USA: Springer, 2010. datasets/Optical+Recognition+of+Handwritten+Digits
[6] C. Spearman, “The proof and measurement of association between two [38] Occupancy Detection Dataset. Accessed: Jul. 10, 2017. [Online].
things,” Amer. J. Psychol., vol. 15, no. 1, pp. 72–101, 1904. Available: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ANGELOV et al.: GENERALIZED METHODOLOGY FOR DATA ANALYSIS 13

Plamen P. Angelov (M’99–SM’04–F’16) received the Ph.D. and D.Sc. José C. Príncipe (F’00) received the bachelor’s degree in electrical engineer-
degrees from the Bulgarian Academy of Science, Sofia, Bulgaria, in 1993 ing from the University of Porto, Porto, Portugal, and the master’s and Ph.D.
and 2015, respectively. degrees from the University of Florida, Gainesville, FL, USA.
He is a Chair Professor of intelligent systems with the School of Computing He is a Distinguished Professor of electrical and computer engineering with
and Communications, Lancaster University, Lancashire, U.K. He is also the University of Florida, Gainesville, FL, USA, where he is also the Eckis
an Honorary Professor with Technical University, Sofia. He holds a wide Professor and the Founding Director of Computational NeuroEngineering
portfolio of research projects and leads the Data Science Group with Lancaster Laboratory. His current research interests include advanced signal processing
University. with information theoretic criteria (entropy and mutual information), adaptive
Dr. Angelov was a recipient of various awards and is internationally models in the reproducing kernel Hilbert spaces, and the application of these
recognized for his pioneering results into on-line and evolving method- advanced algorithms in brain machine interfaces.
ologies and algorithms for knowledge extraction in the form of human- Dr. Principe was a recipient of the IEEE EMBS Career Award and the
intelligible fuzzy rule-based systems and autonomous machine learning. He IEEE Neural Network Pioneer Award. He is the past Editor-in-Chief of the
is the Editor-in-Chief of the Evolving Systems Journal (Springer) and an IEEE T RANSACTIONS ON B IOMEDICAL E NGINEERING, the past Chair of
Associate Editor of the IEEE T RANSACTIONS ON F UZZY S YSTEMS, the the Technical Committee on Neural Networks of the IEEE Signal Processing
IEEE T RANSACTIONS ON C YBERNETICS, and several other journals. He is Society, and the past President of the International Neural Network Society.
the Vice President of the International Neural Networks Society and a mem- He is a fellow of the International Academy of Medical and Biological
ber of the Board of Governors of the Systems, Man and Cybernetics Society Engineering and American Institute for Medical and Biological Engineering.
of the IEEE, and a Distinguished Lecturer of IEEE.

n
Xiaowei Gu received the B.E. and M.E. degrees from Hangzhou Dianzi

io
University, Hangzhou, China. He is currently pursuing the Ph.D. degree in
computer science with Lancaster University, Lancashire, U.K.

rs
l Ve
na
Fi

View publication stats

You might also like