Geometrical and topological approaches to Big Data

Václav Snášel a , Jana Nowaková a,∗ , Fatos Xhafa b , Leonard Barolli c
Department of Computer Science, Faculty of Electrical Engineering and Computer Science, VŠB - Technical University of Ostrava, 17. listopadu 15/2172,
708 33 Ostrava - Poruba, Czech Republic
Department of Computer Science, Technical University of Catalonia, C/Nord, Omega Bld, C/Jordi Girona 1-3, 08034 Barcelona, Spain
Department of Information and Communication Engineering, Faculty of Information Engineering, Fukuoka Institute of Technology (FIT), 3-30-1
Wajiro-higashi, Higashi-ku, Fukuoka 811-0295, Japan

• An overview of state-of-the-art in geometrical and topological approach to Big Data.
• Trends in geometrical and topological approach to Big Data.
• Big Data visualization.
• Discussion of current techniques and future trends to address the applications.

Article history: Modern data science uses topological methods to find the structural features of data sets before further
Received 6 March 2016 supervised or unsupervised analysis. Geometry and topology are very natural tools for analysing massive
Received in revised form amounts of data since geometry can be regarded as the study of distance functions. Mathematical
25 May 2016
formalism, which has been developed for incorporating geometric and topological techniques, deals
Accepted 6 June 2016
with point cloud data sets, i.e. finite sets of points. It then adapts tools from the various branches
of geometry and topology for the study of point cloud data sets. The point clouds are finite samples
taken from a geometric object, perhaps with noise. Topology provides a formal language for qualitative
Big Data mathematics, whereas geometry is mainly quantitative. Thus, in topology, we study the relationships of
Industry 4.0 proximity or nearness, without using distances. A map between topological spaces is called continuous if
Topological data analysis it preserves the nearness structures. Geometrical and topological methods are tools allowing us to analyse
Persistent homology highly complex data. These methods create a summary or compressed representation of all of the data
Dimensionality reduction features to help to rapidly uncover particular patterns and relationships in data. The idea of constructing
Big Data visualization summaries of entire domains of attributes involves understanding the relationship between topological
and geometric objects constructed from data using various features.
A common thread in various approaches for noise removal, model reduction, feasibility reconstruction,
and blind source separation, is to replace the original data with a lower dimensional approximate
representation obtained via a matrix or multi-directional array factorization or decomposition. Besides
those transformations, a significant challenge of feature summarization or subset selection methods for
Big Data will be considered by focusing on scalable feature selection. Lower dimensional approximate
representation is used for Big Data visualization.
The cross-field between topology and Big Data will bring huge opportunities, as well as challenges,
to Big Data communities. This survey aims at bringing together state-of-the-art research results on
geometrical and topological methods for Big Data.
1. Introduction

Big Data is everywhere as high volumes of varieties of valuable

of social, medical, scientific and engineering data has been driven separation, is to replace the original data with a lower dimensional
by our need for fundamental understanding of the processes which approximate representation obtained via a matrix or multi-
produce this data. It is predicted that volume of the produced directional array factorization or decomposition. Besides those
data could reach 44 zettabytes in 2020 [1]. The enormous volume transformations, a significant challenge of feature summarization
and complexity of this data propel technological advancements or subset selection methods for Big Data will be considered
realized as exponential increases in storage capability, processing by focusing on scalable feature selection. Lower dimensional
power, bandwidth capacity and transfer velocity. This is, partly, approximate representation is used for Big Data visualization to be
because of new experimental methods, and in part because of the able to visualize data in the understandable form. This approach—
increase in the availability of high-powered computing technology. dimensionality reduction can be also understood as a method for
Massive amounts of data (Big Data) are too complex to be managed feature compression, see Fig. 6.
by traditional processing applications. Nowadays, it includes the The whole paper is organized as follows: in Section 2, a brief in-
huge, complex, and abundant structured and unstructured data troduction to Big Data technologies is given. In the next Section 3,
that is generated and gathered from several fields and resources. a brief motivational example is presented. A short mathematical
The challenges of managing massive amounts of data include background is introduced in Section 4. This part contains a brief re-
extracting, analysing, visualizing, sharing, storing, transferring and view of topology, metric space, homology and persistent homology
searching such data. Currently, traditional data processing tools theory, manifolds and Morse theory. In the following Section 5, a
and their applications are not capable of managing Big Data. brief review of homology and persistent homology theory is intro-
Therefore, there is a critical need to develop effective and efficient duced. Various applications of geometrical and topological meth-
Big Data processing techniques. Big Data has five characteristics: ods are presented in Section 6. Big Data visualization is discussed
volume, velocity, variety, veracity and value [2]. Volume refers to in Section 7. In this section, we discuss methods to create a sum-
the size of the data for processing and analysis. Velocity relates mary or compressed representation of all of the data features to
to the rate of data growth and usage. Variety means the different help visualize hidden relationships in data. This is followed by the
types and formats of the data used for processing and analysis. section described and introduced new, perspective Big Data chal-
Veracity concerns the accuracy of results and analysis of the lenges. This paper ends with conclusions in Section 9.
data. Value is the added value and contribution offered by data
processing and analysis. 2. Big Data technologies during time
Modern data science uses so-called topological methods to find
the structural features of data sets before further supervised or The manner in which data is stored, transmitted, analysed and
unsupervised analysis. Geometry and topology are very natural visualized has varied over time; the rise of all fields of human
tools for analysing massive amounts of data since geometry activities is always connected with an increase in technological
can be regarded as the study of distance functions. Besides the possibilities, as with the political situation, development of the
heterogeneity of distance functions, another issue is related to socio-economical arrangement and industry. In 1936, Franklin
distance functions on large finite sets of data. The mathematical D. Roosevelt’s administration in the USA, after Social Security
formalism which has been developed for incorporating geometric became law, ordered from IBM the development of the punch card-
and topological techniques deals with point clouds, i.e. finite reading machine to be able to collect data from all Americans
sets of points equipped with proximity or nearness or distance and employers. This biggest accounting operation of all time, as
functions [3,4]. It then adapts tools from the various branches of it was called at that time, can be considered as the first major
geometry and topology for the study of point clouds [5]. The point data project [10–13]. As already mentioned, the political situation
clouds are finite samples taken from a geometric object, perhaps always has a big influence on the rise of technology and the
with noise. main mover of its development has always been war and money.
Geometrical and Topological methods are tools for analysing During World War II, the British invented, in 1943, a machine
highly complex data [3]. These methods create a summary or Colossus to decipher German codes. The device, which searched
a compressed representation of all of the data features to help for patterns in encrypted messages at a rate of 5000 characters
rapidly uncover patterns and relationships in data. The idea of con- per second, is known as the first data-processing machine [14,10].
structing summaries of entire domains of parameter values in- Big Data as a term has been one of the biggest trends in recent
volves understanding the relationship between geometric objects years, leading to an increase in research, as well as industry and
constructed from data using various parameter values, e.g. [8]. government applications [15–17]. The continued improvements
One problem with Big Data analysis, which is very actual, is that in high-performance computing and high resolution sensing
the currently used methods based on model creation, simulation capabilities have resulted in data of unprecedented size and
of the created model and then assessment, whether the original complexity. Data is deemed a powerful raw material that can
data corresponds to data obtained using the created model–model impact multidisciplinary research.
verification cannot be applied. The described process is useful and
appropriate for solving classic problems such as physical problems, 2.1. Data storage
because the theoretical background of these problems has been
researched and understood enough, so it could be reconstructed We face a wave of data; the amount of data is so big that a
to fit the model. For Big Data processing, the first problem is lot of information is never looked at by anybody [18]. The next
that we are not able to define the concrete hypothesis of the problematic aspect of data is that a big part of it is redundant,
data feature which could be tested. Due to this, for the Big Data e.g. one video due to many existing video formats, its resolution
problem, the same approach as with the classic physical problem and subtitles in many languages [19] takes up lots of space,
cannot be used. Therefore, the main aim of the research is not to which is necessary from an informational point of view but
define a model, but to be able to mine accurately and automatically generally it does not bring anything new. The manner in which
interesting features of Big Data sets. In many cases, the data to be data is stored has changed: what was sufficient in 1965, when
examined is often based on shapes that are not easy to capture the US Government decided to found the first data centre to
using traditional methods [9]. store 175 million sets of fingerprints and 742 million tax returns
A common thread in various approaches for noise removal, and store data onto magnetic computer tape [20], is nowadays
model reduction, feasibility reconstruction, and blind source unusable. Traditionally, persistent data is still stored using hard
V. Snášel et al. / Future Generation Computer Systems ( ) – 3

disk drives (HDD) [21] with all the disadvantages which they have,
such as boundaries on their access times, a lifetime limited by
mechanical (moving) parts, and DRAM (volatile memory) with
faster access. The trend is to replace HDDs with solid-state drives
(SSD) as a type of non-volatile memory (NVM) [22,23]. Other
types of NVM, which are now also on the rise, are phase-change
memory (PCM) and memristors. These will be integrated as
byte/addressable memory on a memory bus or stacked directly on
a chip (3D-stacking) [24]. All existing storage architectures, such as
storage area networks (SAN), network-attached storage (NAS) and
direct-attached storage (DAS), were ordinarily used before large-
scale distributed systems were required and the aforementioned Fig. 1. Big Data source with depicted data uncertainty [6,7].
architectures met their limitations [22,23].
which, for a fixed and arbitrarily small ϵ , but increasing n, it
2.2. Data transmission approaches bn . A 20-dimensional watermelon with a radius of
20 cm and skin with a thickness of 1 cm is nearly two-thirds skin
Cloud computing and cloud data storage are, nowadays, very   n 
popular. Users do not have time and do not want to maintain 1− 1− = 1 − e −1 . (2)
data storage and computing hardware, so the easiest way is to ϵ
send data to the cloud [25]. However, this modern technology also This circumstance plays a significant role in statistical mechanics.
has its limits—the volume of communication capacity and security Consider, for example, the simplest model of gas in a reservoir
[26,23]. Cloud computing is still considered a hot trend. consisting of n atoms, which we shall assume are material points
with mass 2 (in an appropriate system of units). We represent
2.3. Data processing/analysis the instantaneous state of the gas by n three-dimensional vectors
(v1 , . . . , vn ) of the velocities of all molecules in the physical
The next question is not where to store data, but how to store Euclidean space; that is, by a point in the three n-dimensional
it and what platform to use to analyse it. The classical approach to coordinate space R3n . The square of the lengths of the vectors in
managing structured data is divided into two parts: the first is to R3n has a direct physical interpretation as the energy of the system
store the data set, and the second is a related database for retrieval (the sum of the kinetic energies of the atoms)
of stored data. Large-scale structured data set management is n
often based on data warehouse and data mart, which are both

E= |vi |2 . (3)
Standard Query Language (SQL) based. SQL is more reliable, and i =1
straightforward and analytic platforms such as Cloudera Impala
and SQLstream run on it [23]. Moreover, recently, the Not Only For a macroscopic volume of gas under normal conditions, n is
SQL (NoSQL) database approach is often used in order to avoid of the order of 1023 (Avogadro’s number), so that the state of the
using the Relation Database Management System (RDBMS) [27]. gas can be described only on a sphere of an enormous dimension,
The most popular management systems using NoSQL databases whose radius is the square root of its energy.
are Hbase, Apache Cassandra, SimpleDB, Google BigTable, Apache We may conclude that a model of a large system (Big Data)
Hadoop, MapReduce, MemchaceDB and Voldemort [23]. must be based on feature summarization or compression or
The analytic methods of Big Data are still under investigation. To subset selection methods.
help deal with Big Data, cloud computing, and then granular com- The increasing amount of VoIP, social media, and sensors
puting, biological computing systems, and quantum computing are data [6,7] emphasizes the need for methods to deal with the
under consideration [23]. uncertainty inherent in these data sources. Currently about 80% of
data is uncertain see Fig. 1. We can face the problem of uncertainty
via application of topological methods. The number of components
3. Motivation examples or holes is not something that changes with small changes. This is
vital to an application in cases where data is very uncertain.
The general problem with statistical physics is the following:
given a large collection of atoms or molecules, given the interaction
4. Mathematical background
laws among the constituents of this collection of particles, and
given the laws of dynamic evolution, how can we predict the
In this section, we summarize the theoretical concepts which
macroscopic physical properties of matter composed of these
are necessary for Big Data processing as presented in the rest of
atoms or molecules?
the paper.
The typical feature-based model [28,29] looks for the most
extreme examples of a phenomenon and represents the data by
these examples, but to describe a large system, this model is not 4.1. Topology
appropriate. A solution to this problem in statistical physics is
based on feature summarization or subset selection methods. A topological space [32–34] is a set of points along with a
This consists of the fact that, for a very large n, the volume of an topology; that is, a collection of subsets that are referred to as open
n-dimensional figure is concentrated near its surface [30,31]. sets. Intuitively, a set U is open if, starting from any point in U and
It is not hard to see that the volume of an n-dimensional ball of going in any direction, it is possible to move a little and still stay
diameter d should be expressed by the formula Vn dn , where Vn is inside the set. It turns out that the notion of an open set provides
constant and does not depend on d. For example, the volume of a a fundamental way of how to speak about the nearness of points,
spherical ring between spheres of radius 1 and 1 − ϵ equals although without explicitly having a concept of distance defined in
the considered topological space. Thus, once a topology has been
Vn 1 − (1 − e)n ,
 
(1) defined, we are allowed to introduce properties such as continuity,
4 V. Snášel et al. / Future Generation Computer Systems ( ) –

(a) Space. (b) Topological space. (c) Metric space.

Fig. 2. Spaces [37] (depicted on Hercules constellation).

connectedness, and closeness, which are all based on some notion

of nearness.
A topological space is a set X and a set τ of subsets of X satisfying
the following axioms:
• ∅ and X are in τ ,
• if U1 , U2 , . . . , Un are in τ , then
so is i=1 Ui ,
• if Ui , i ∈ I are in τ , then so is i∈I Ui .
A map f between topological spaces is said to be continuous if
the inverse image of every open set is an open set. A homeomor-
phism is a continuous bijection whose inverse is also continuous.
Two topological spaces (X , τX ), (Y , τY ) are said to be homeomor-
phic if there exists a homeomorphism f : X → Y . From the view-
point of topology, homeomorphic spaces are essentially identical.
Properties of topological space which are preserved up to homeo-
morphisms are said to be topological invariants.
The notion of metric is a straightforward generalization of Fig. 3. Charts on a manifold.
Euclidean distance through its three properties listed there. Given
a nonempty set X , we say that a mapping d : X × X → R is a metric The transition map
if it satisfies the following properties:
τα,β : φα (Uα ∩ Uβ ) → φβ (Uα ∩ Uβ ), (4)
• for all points x and y, d(x, y) ≥ 0 and d(x, y) = 0 if and only if
x = y, is the map defined by
• for all points x and y, d(x, y) = d(y, x),
τα,β = φβ ◦ φα−1 . (5)
• for all points x, y and z d(x, y) + d(y, z ) ≤ d(x, z ).
The pair (X , d) is called a metric space. If the metric d is Note that since φα and φβ are both homeomorphisms, the
understood from the context we will often refer to X as being a transition map τα,β is also a homeomorphism, see Fig. 3. Depending
metric space. A systematic description of metric has been given by on the type of the transition functions (e.g., smooth, analytic,
Deza [35,36]. piecewise smooth, Lipschitz), the manifold is consequently named
Fig. 2 shows how we can transform cloud points to nearness (e.g. smooth manifold, analytic manifold, etc.). A compact manifold
structure (topology space) and distance structure (metrics space). is a manifold that is compact as a topological space. A closed
manifold is a compact manifold without a boundary. An important
property of a manifold concerns orientability. Then, a manifold M
4.2. Manifolds
is called orientable if there exists an atlas A = {(Ui , φi )} on it such
that the Jacobian of all transition functions φi,j from one chart to
The natural, higher-dimensional analogue of a surface is
another is positive for all intersecting pairs of regions. Manifolds
an n-dimensional manifold, which is a topological space with
that do not satisfy this property are called non-orientable. We
the same local properties as Euclidean n-space. Because they
prefer here to skip the technicalities needed to formally define such
frequently occur and have applications in many other branches of
a notion, referring the reader to [38,39] for further details.
mathematics, manifolds are certainly one of the most important
classes of topological spaces.
A topological manifold is a space M locally homeomorphic to 4.3. Algebraic topology
Rn . That is, there is a cover A = {Uα } of M by open sets along
with maps φα : Uα → Rn that φα are homeomorphisms. The cover The approach adopted by algebraic topology is the transla-
A = {Uα } is called an atlas. This tuple (Uα , φα ) is called a chart. tion of topological problems into an algebraic language, to solve
Such local homomorphism is called a coordinate system on Uα and them more easily. There are classics resources of algebraic topology
enables the identification of any point u ∈ Uα with an n-tuple of [40–43]. These resources are written without high-level formal-
Rn . M is an n-dimensional manifold with a boundary if every point ism.
has a neighbourhood homeomorphic to an open set of either Rn or In persistent homology, we ultimately want to compare
the half-space {u = (u1 , . . . , un ) ∈ R | un ≥ 0}. topological spaces based on the characteristic holes that they
Suppose that (Uα , φα ) and (Uβ , φβ ) are two charts for a encompass. Because we usually operate with finite point clouds in
manifold M such that Uα ∩ Uβ is non-empty. data analysis, we first need to discretize the space to add the notion
V. Snášel et al. / Future Generation Computer Systems ( ) – 5

Fig. 4. Čech (lower left) and Rips (lower right) complex built on the fixed set of points (upper left) with the depiction of the formation of the both complexes.

of connectivity. That is done through the creation of simplicial in [48]. However, signatures are applicable only in systems with a
complexes. A p-simplex σ is the convex hull of p + 1 linearly single type of component, as all its components have to be charac-
independent points x0 ; x1 , . . . , xp ∈ Rd [44]. More intuitively, a terized as exchangeable random quantities. This work is from the
0-simplex is a vertex, an 1-simplex is an edge, a 2-simplex is a theoretical part of research, so its practical usage for real systems
triangle, a 3-simplex is a tetrahedron, and so forth. A simplicial is limited, because real systems tend to have components of mul-
complex K is a finite set of simplices such that, for σ ∈ K , all of tiple types. Signatures cannot be used for analysing the reliability
its faces are also in K . of networks, it is caused by the existence, at least, of two different
The core idea of persistent homology is to analyse how holes kinds of components—links and nodes.
appear and disappear, as simplicial complexes are created. To do
this, a filtration is constructed. An increasing sequence of ϵ values, 4.4. Morse theory
i.e., distance values, produces a filtration, such that a simplex
enters the sequence no earlier than all its faces. In this section, we report some classical results in Morse
The Vietoris–Rips complex, Fig. 4, is one of the most popular theory [49], which constitutes the essential mathematical root for
complexes in persistent homology. For a non-negative real number Reeb graphs.
ϵ , the Vietoris–Rips complex V (K , ϵ) at scale ϵ is defined as Morse theory can be seen as the investigation of the relation
follows: between functions defined on a manifold and the shape of the
manifold itself. The key feature of Morse theory is that information
V (K , ϵ) = {σ ⊂ K | d (x, y) ≤ ϵ} for all x, y ∈ σ . (6)
on the topology of the manifold is derived from the information
For ϵ ≤ ϵ , we have V (K , ϵ) ⊆ V (K , ϵ ), so considering the
′ ′
about the critical points of real functions defined on the manifold.
different values of the scale ϵ yields a filtered simplicial complex. Morse theory is a means of relating the global features of (in the
The dimension of the Vietoris–Rips complex is bounded only by the classical setting) a Riemannian manifold M with the local features
size of K , therefore, in practice, it is necessary to put a limit on the of critical points of smooth R-valued functions on M. Recall that
dimension of the simplices that one allows in the construction of h : M → R is Morse if all critical points of h are non-degenerate,
the Vietoris–Rips complex. in the sense of having a non-degenerate Hessian matrix of second
The Čech complex, Fig. 4, is defined as a set of simplices partial derivatives. Denote by Cr (h) the set of critical points of
such that ϵ/2-ball neighbourhoods have a point of common h. For each p ∈ Cr (h), the Morse index of p, µ(p), is defined
intersection. as the number of negative eigenvalues of the Hessian at p. The
A typical case is the construction of algebraic structures to theory identifies points which level sets of the function undergo
describe topological properties, which is the core of homology topological changes, and it relates these points via a complex. In
theory, one of the main tools of algebraic topology. In [45], particular, Morse theory provides the mathematical background
persistent homology is presented as a new approach to the underlying several descriptors, such as Reeb graphs, size functions,
topological simplification of Big Data via measuring the lifetime persistence diagrams and Morse shape descriptors. For a detailed
of internal topological features during a filtration process. This overview of Morse theory, see [50,51].
approach was assessed as being exploitable in many scientific Let h : M → R be a continuous function defined on a domain
and engineering applications. In [46], a broad view is given of M. For each scalar value a ∈ R, the level set h−1 (a) = {x ∈ M |
the theory of persistence, including its topological and algorithmic h(x) = a} may have multiple connected components. The Reeb
aspects, and an elaboration on its context to quiver theory on the graph Fig. 5 of h, denoted by Rbh (M ), is obtained by continuously
one hand and to data analysis on the other. This book also contains identifying every connected component in a level set to a single
many open problems in topological data analysis. point. In other words, Rh (M ) is the image of a continuous surjective
Another concept of the persistence is the Survival Signa- map Φ : X → Rbh (X ), where Φ (x) = Φ (y) if, and only if, x and y
ture [47]. This concept has became a popular tool for analysis and come from the same connected component of a level set of h. For a
assessment of system reliability. Samaniego introduced this topic detailed overview of a Reeb graph, see [39].
6 V. Snášel et al. / Future Generation Computer Systems ( ) –

features of the data is persistent regardless of changing the scale.

This approach is called persistent homology, and it is considered to
be the most useful and helpful method for finding the topological
structure of a discrete data set. Persistent homology has found its
place in various application areas for its ability to discover the
topological structure of data.
Persistent homology for data analysis has been studied by
many researchers in mathematics and computer science, e.g.
Carlsson [3], Edelsbrunner and Harer [39], Ghrist [57], Oudot [46]
and Zomorodian [58,37].
To drive the reader through the bunch of approaches and
frameworks revised here, we must first introduce the basic notions
of mathematical concepts such as topological space, manifold,
Fig. 5. Morse function and Reeb graph on the 2-sphere. map, metric and transformation. We also provide a brief overview
of algebraic topology.
5. Topological data analysis How do we find the topological structure of the data sets?
A technology called persistent homology analysis was proposed
Geometry is understood and used mainly as quantitative to solve this problem [3,5,57]. The topological structure of the
mathematics, while topology, on the other hand, provides a data sets is now one of the major areas where mathematicians
formal language for a qualitative approach. In topology, the and computer scientists have focused considerable attention. The
relationships of nearness or proximity are studied, but without geometric structure of massive amounts of data, or Big Data, will be
using distances. A map between topological spaces is called critical in data analysis. We predict that there will be many more
continuous if the nearness structures are retained. Nowadays, new findings in theory and practice.
in algebra, we study maps that preserve product structures; for Historically, geometrical and topological techniques have
example, group homomorphisms between groups, and one of the been deployed as independent alternatives in the analysis of
largest areas of growth in pure mathematics this century has been a variety of data types. However, the continuing increases in
the solution of topological problems by casting them into a simpler size, dimensionality, number of variables, and uncertainty create
form using groups. This theory is called algebraic topology and, like new challenges that traditional approaches cannot address. New
analytical geometry and differential geometry before it, there is methods based on geometrical and topological techniques are
considerable interplay with some of the most fundamental ideas needed to support the management, analysis and visualization of
in computer science. Big Data [59,3,39,57,58,37].
Topological data analysis aims to provide additional tools for An essential part of Big Data processing is the need for different
analysing data sets that appear in engineering and science. The goal types of users to apply visualizations [59–61] to understand a
is not to replace current techniques because these techniques still result of Big Data processing. Recently, it became apparent that
supply an additional and powerful approach for mining intuitive a large number of the most interesting structures and world
features (as well as not-so-intuitive) in data collections. Proposed phenomena could be described by networks. Developing a theory
approaches focus on the data shape, and can be implemented to for very large networks is a significant challenge in Big Data
data sets of high dimensions. research [62]. Big Data is one of the main science and technology
As computational topology has undergone progress, now we challenges of today.
are able to deduce topological invariants from data. The input of In mathematical science, homology is a general procedure to
these procedures is often in the form of a point cloud, regarded as associate a sequence of abelian groups or modules to a given
possibly noisy observations from an unknown lower-dimensional topological space and/or manifold [39,63]. The idea of homology
set whose topological features, which could have information dates back to Euler and Riemann, although the homology class was
potential, were lost during a sampling procedure. Sampling data first rigorously defined by Henri Poincaré, who built the foundation
is a way to get sublinear algorithms. Sublinear algorithms are a of modern algebraic topology. The topological structure of a given
recent development in theoretical computer science, statistics, and manifold can be studied by defining the different dimensional
discrete mathematics, which address the mathematical problem homology groups on the manifold such that the bases of the
of understanding global features of a data set using limited homology groups are isomorphic to the bases of the corresponding
resources. Often enough, to determine important input features, topological spaces. In a computational point of view, we can
one does not need to actually look at the entire input. The field approximate the given manifold using a triangulated simplicial
of sublinear algorithms [16] makes precise the circumstances complex, on which homology groups can be further defined.
when this is possible and combines discrete mathematics and There exist some methods, such as Delaunay triangulation, which
algorithmic techniques with a comprehensive set of statistical can be used for the triangulation of a manifold or topological
tools to quantify errors and give trade-offs with sample sizes. The spaces. And, there are many triangulation software packages,
output is a collection of data summaries that are used to estimate such as TetGenand CGAL. The Cartesian representation is one of
the topological features of data collection. There are software the most important approaches to scientific computing. Due to
packages for computing topological invariants from data [52–56]. this homology analysis being based on a cubical complex, it has
By using homology, the features of a topological space such as been a popular field for researchers in recent years. Kaczynski
an annulus, sphere, torus, complicated surface or manifold can be et al. described homology analysis in the cubical complex very
measured. Homology is so helpful that, thanks to it, it is possible systematically in [64].
to differentiate spaces from one another using the quantified
connected components, trapped volumes, topological circles, etc. 6. Application of computational geometry and topology
On a finite set of data points, a (noisy) sampling from an underlying
topological space can be seen. The homology of the data can be Persistent homology creates a multiscale representation of
measured using the connections’ proximate data points; changing topological structures via a scale parameter relevant to topolog-
the scale of which these connections are made and finding the ical events [65–68]. In the past decade, persistent homology has
V. Snášel et al. / Future Generation Computer Systems ( ) – 7

been developed as an efficient computational tool for the char- 7. Big Data visualization
acterization and analysis of topological features in large data sets
[65,68,69]. Persistent homology can be maintained continuously, The emergence of Big Data has brought about a paradigm shift
despite the filtration process, over a range of spatial scales in through computer science, such as the fields of computer vision,
persistent homology analysis. Persistent homology, by its nature, machine learning, and multimedia analysis. Visual Big Data, which
when compared to traditional computational topology [70–72] is specifically about visual information such as images and videos,
and/or computational homology (which results in truly metric- accounts for a large and important part of Big Data. Many the-
free or coordinate-free representations), exhibits one additional ories and algorithms have been developed for visual Big Data in
dimension—the filtration parameter. This additional parameter recent years, among which the dimensionality reduction tech-
finds its use in building some crucial geometry or quantitative nique [96–98] plays an increasingly important role in the analy-
sis of visual Big Data. Unfortunately, conventional statistical and
information into the topological invariants, so that the birth and
computational tools are often severely inadequate for process-
death of isolated components, cavities, circles, loops, pockets, rings
ing and analysing large-scale, multi-source and high-dimensional
or voids at all geometric scales can be defined by topological mea-
visual Big Data. The combination dimensionality reduction and
surements. For the visualization of topological persistence [73], a
visual Big Data will bring about huge opportunities as well as
Barcode representation has been proposed, in which various hor-
challenges to these communities. In recent years, this area has
izontal line segments or bars are used to interpret the persistence gained much attention, thanks to the development of nonlinear
of the topological features. spectral dimensionality reduction methods, often referred to as
Efficient computational algorithms, such as the pairing algo- manifold learning algorithms, see [99].
rithm [74,75], Smith normal form [39,68] and Morse reduction The authors of [100] presented a tool for extracting a feature
[65,76,77], have been proposed to track topological variations of data using selected dimension reduction techniques. From the
during the filtration process [78,72]. Some of these persistent verified methods were chosen non-negative matrix factorization,
homology algorithms have been implemented in many software singular value decomposition, semi-discrete decomposition, a
packages, namely Perseus [53], JavaPlex [55] and Dionysus [56]. novel neural network-based algorithm for Boolean factor analysis,
In [79], guidelines are provided for the computation of persistent and two cluster analysis methods as well. As the benchmark, the so
homology with a good introduction on how to make our imple- called bars problem was applied. The authors proposed generating
mentations. sets of artificial signals as a Boolean sum of the given number
In the past few years, persistent homology has been applied to of bars and then analysing it using selected dimension reduction
image analysis [80], image retrieval [81], chaotic dynamics veri- techniques. From the results, it was deduced that Boolean factor
fication [64], sensor networks [82], complex networks [83], data analysis is the most suitable method for this kind of data.
analysis [3,84–86], computer vision [87,88], shape recognition [89] Data or information visualization is used to synthesize infor-
and computational biology [90]. mation and knowledge from massive, dynamic, ambiguous, un-
certain, noisy and often conflicting data. Information visualization
Advances in medicine, particularly in genetic engineering, have
is a broad research area that aims to aid users in exploring, un-
increased the amount of genome-wide gene expression data, but
derstanding, and analysing data through progressive, iterative vi-
the number of pattern recognition methods, which could be useful
sual exploration [101]. The rise of the field of Big Data caused
in this area, is still not huge enough. To be able to find interesting
the need for development of areas closely connected with it, such
and adequate enough fact patterns in such huge amounts of data
as machine learning, computer vision and multimedia analysis.
connected with some level of noise is still a big challenge. In [91], With the boom in Big Data and deep data analytics, visualiza-
a new approach to Pattern detection in gene expression data is tion is being widely used in a variety of data analysis applications
presented. [102,103]. Big Data visualization is one of the most needed fields
Recovering or inferring a hidden structure from discrete sam- for Big Data processing. Many theories and algorithms were de-
ples is a basic problem in data analysis, omnipresent in a wide veloped, and are still being developed, to help visualize Big Data,
range of applications. Data often shows a considerable high- because the well known and used already-developed tools and
dimension; for the understanding and finding of interesting infor- statistical methods are very often appropriate for the nature of
mation, which are hidden in data, it is necessary to approximate large-scale, multi-source and high-dimensional visual Big Data. To
it with a low-dimensional or even with one-dimensional space, understand the knowledge and relationships, which are ‘‘hidden
because many important aspects of data are often internally low- in pure data’’, it is necessary to be able to understand the rela-
dimensional. Morse theory and Reeb graphs are a simple but signif- tionship between geometric objects constructed from data using
icant scenario, where the hidden space has a graph-like geometric various parameter values. To be able to visualize data in a more
structure, such as the branching filamentary structures formed by understandable form based on the same principle, the aggrega-
blood vessels. tion of original attributes is used, using various techniques, among
In [92], a straightforward and efficient algorithm is presented which the dimensionality reduction technique [96,97] plays an in-
to approximate the Reeb graph Rbf (M ) of a map f : M → R from creasingly important role. The connection between dimensionality
point data sampled from a smooth and compact manifold M. reduction techniques and visual Big Data will introduce huge op-
portunities as well as challenges to the community interested in
In [93], an overview of the mathematical properties of Reeb
this area [17].
graphs is given. In [94], the authors introduced a framework to
Dimensionality reduction techniques are based on assign-
extract, as well as to simplify, a one-dimensional skeleton from
ment of high-dimensionality space to lower and, ideally, to low-
unorganized data using the Reeb graph. They apply a proposed
dimensional space (3D, 2D) to be better able to visualize data
algorithm for molecular simulation. The input is molecular
(Fig. 6) and to solve a fundamental problem in a variety of data
simulation data using the replica-exchange molecular dynamics analysis tasks—to find an appropriate representation for the given
method [95]. It contains 250K protein conformations, generated data, see [104]. Dimensionality reduction methods can be divided
by 20 simulation runs, each of which produces a trajectory in the into two groups, linear and non-linear methods. The linear meth-
protein conformational space. Simulations at low energy should ods transform original variables into a new variable using the
provide a good sampling of the protein conformational space linear combination of the original variables. From linear di-
around the native structure of this protein. mensionality reduction techniques this can be named Principal
8 V. Snášel et al. / Future Generation Computer Systems ( ) –

Fig. 6. Principle of dimensionality reduction.

Component Analysis (PCA), Linear Discriminant Analysis (LDA), multi-agent system [117] supported with Big Data based feedback
Multi-Dimensional Scaling (MDS), Linear Discriminant Analysis and coordination.
(LDA), Canonical Correlations Analysis (CCA), Maximum Autocor- The Industry 4.0 describes a CPS oriented production sys-
relation Factors (MAF), Slow Feature Analysis (SFA), Sufficient Di- tem [118–121] that integrates production facilities, warehousing
mensionality Reduction (SDR), Locality Preserving Projection (LPP), systems, logistics, and even social requirements to establish the
Under complete Independent Component Analysis (UICA), Proba- global value creation networks [122]. Big data and cloud com-
bilistic PCA (PPCA), Factor Analysis (FA), Linear Regression (LR), or puting for Industry 4.0 are viewed as data services that utilize
Distance Metric Learning (DML) [105–108]. the data generated in Industry 4.0 implementations but are not
Non-linear methods include, e.g. Non-linear Manifold Learn- independent as Industry 4.0 components [123,124]. For Industry
ing Methods (Laplacian Eigenmaps (LE), Locally Linear Embedding 4.0 and Smart Manufacturing processes dealing with large data
(LLE), Isomap, Hessian Eigenmap, Semi-Definite Programming storage, sharing data, processing and analysing have become key
(SDE), Manifold Based Charting, Local Tangent Space Alignment challenges to computer science research. Some examples of these
(LTSA), Diffusion maps, Parallel vector Field Embedding (PFE), include efficient data management, additional complexity arising
Geodesic Distance Function Learning (GDL) and Parallel Field from analysis of semi-structured or unstructured data and quick
Alignment for cross media Retrieval (PFAR)), Discriminative lo- time critical processing requirements. To resolve these issues, un-
cality alignment (DLA), or Generalized Eigenvectors for Multiclass derstanding of this massive amount of data, advanced visualization
(GEM) [109,105–107,110]. and data exploration techniques are critical [125].
The latest research in the area of decomposition methods has Analytics based on Big Data has emerged only recently in the
introduced some new approaches, which enable the restrictions manufacturing world, where it optimizes production quality, saves
and limits of conventional methods to be dealt with. Wang energy, and improves equipment service [126,127]. In an Industry
et al. [111] proposed the generalized Discriminative Generalized 4.0 context, the collection and complete evaluation of data from
Eigendecomposition (DGE) method based on the idea that better many heterogeneous sources will become standard to support
separation of a multi-dimensional feature could be helpful in real-time decision-making.
finding better discriminant vectors. DGE can deal with Gaussian Example of Big Data collection is, in research area, well known
and non-Gaussian distribution. In [112], the combination of LDA DataBase systems and Logic Programming (DBLP) Computer
and LPP as the RElevant Local Discriminant Analysis RELDA Science Bibliography, which provides bibliographic information
algorithm is presented, which has an analytical form of the globally on major computer science journals and proceedings. Nowadays,
optimal solution, and it is based on eigendecomposition, too. Some more than 3.35 million records, which contain titles of articles,
new interesting variants of LPP are introduced in [113]. their authors, years of publication is indexed in DBLP and since
2011, more than 300 thousands records have been added every
8. Big Data challenges year. DBLP is database with open access and due to its content and
its size, it is a very interesting resource for evolution analysis of co-
One of the biggest challenges for Big Data research, we face to- author networks and could be considered as one of the example of
day, is digitization. Digitization is the main part of cyber–physical Big Data data set in nonindustrial world [128,129].
systems (CPS) which introduces the fourth stage of industrializa-
tion, commonly known as Industry 4.0. A strategic initiative called 9. Conclusion
Industrie 4.0 (Industry 4.0) has been proposed and adopted by the
German government as a part of the High-Tech Strategy 2020 Ac- The last few years have seen a great increase in the amount
tion Plan [114]. Industry 4.0 or the fourth industrial revolution, is of data available to scientists, engineers, and researchers from
a collective term embracing some contemporary automation, data many disciplines. Modern data science uses topological methods
exchange, and manufacturing technologies. Similar strategies have to find the structural features of data sets before further supervised
also been proposed by other main industrial countries, e.g., Indus- or unsupervised analysis. The size of data at present is huge
trial Internet [115] by the USA and Internet+ [116] by China. Indus- and continues to increase every day. Data sets with millions of
try 4.0 is also referred as Smart Factory, Cyber–Physical Production objects and hundreds, if not thousands, of measurements, are
Systems or Advanced Manufacturing, but the meaning is mostly now commonplace in areas such as image analysis, computational
the same. It is defined as a collective term for technologies and con- finance, bio-informatics, and astrophysics. The variety of data
cepts of value chain organizations which draw together CPS, the being generated is also expanding. The velocity of data generation
Internet of Things and the Internet of Services. Smart factory mod- and its growth is increasing because of the proliferation of IoT,
elling based on virtual design and simulation has emerged as a part sensors connected to the Internet. This data provides opportunities
of the mainstream activities geared towards reducing product de- that allow businesses across all industries to gain real-time
sign cycle. The smart factory is characterized by a self-organized business insights. We present motivational examples to show
V. Snášel et al. / Future Generation Computer Systems ( ) – 9

You might also like