Mining Temporally-Varying Phenomena in Scientific Datasets
Mining Temporally-Varying Phenomena in Scientific Datasets
Mining Temporally-Varying
Phenomena in Scientific
Datasets
, S. Parthasarathy
, J. Wilkins ,
R. Machiraju
D. Thompson , B. Gatlin , D. Richie , T. Choy , M. Jiang ,
S. Mehta , M. Coatney , S. Barr , K. Hazzard
Department of
Computer and Information Sciences, The Ohio State University
Department of Physics, The Ohio State University
Department of Aerospace
Engineering, Mississippi State University
HiPTi Corporation
Abstract:
Simulation is enhancing and, in many instances, replacing experimentation as a means
to gain insight into complex physical phenomena. Recent advances in computer hardware and numerical methods have made it possible to simulate physical phenomena at
very fine temporal and spatial resolutions. Unfortunately, given the enormous sizes of
the datasets involved, analyzing datasets produced by these simulations is extremely
challenging. In order to more fully exploit simulation, the analysis of these large
datasets must advance beyond current techniques that are based on interactive visualization.
We outline our vision for one such approach and describe progress on a unified
framework that promises to provide a novel method to explore large simulation data
191
3.1 Introduction
The physical and engineering sciences increasingly study large, complex ensembles
seeking to understand the underlying phenomena. These studies require analysis of
the data generated by either experiments or computational simulations. In this chapter,
we focus on the latter and provide motivation using applications from two disparate
fields numerical simulations of fluid flow and molecular dynamics. Computational
fluid dynamics (CFD) seeks to understand flow patterns to enhance, for instance, drug
delivery schemes for pulmonary treatments for asthma. Similarly, molecular dynamics
(MD) seeks to understand the evolution of material defects that affect the properties or
performance of industrial materials. In these data, patterns of interest arise and evolve
over time as a result of the unsteady nature of the phenomenon under consideration.
Scientific discoveries are often best understood visually from Galle seeing Neptune in 1846 to Binnig and Rohrer seeing atoms on a surface in the twentieth century.
Both discoveries were not surprises in the sense that previous analysis had convinced
most of their reality. However, each discovery stimulated future work more dramatically than any analysis might have done.
Unfortunately, the size of simulation datasets significantly challenges our abilities
to explore and comprehend effectively the generated data. Analysis via interactive visualization sessions is tantamount to searching for the proverbial needle in a haystack.
Currently, a well-trained individual may need several days or even weeks to analyze the
data generated by an MD simulation and create a list of viable defect structures. Similarly, in the extremely large datasets generated by simulations of complex fluid flows,
locating and tracking relevant features are daunting tasks. In both cases, phenomena
occur on multiple length and time scales. Some features persist sufficiently to have
gross macroscopic effects. Other short-lived transients are precursor events central to
the unsteady (in the temporal domain) behavior of the system. An additional complication is that currently available hardware does not have the prowess yet to provide even
near real-time visualizations.
Therefore, we believe it is crucial that some degree of automation be incorporated
into the exploration process for large datasets. One such successful approach is described in [Machiraju et al.2001] and is based on a representational scheme that facilitates ranked access to macroscopic features in the dataset. However, other than identifying, denoising, and ranking the features, no attempt is made to extract information
M ACHIRAJU ,
ET AL .
193
M ACHIRAJU ,
ET AL .
195
Molecular
Dynamics
Spatial Partition/
Derived Quantites
Time-Domain
MRA
Meta-Stability
Detection
Transient
Detection
Shape/Structure
Identification
Event
Identification
Feature
Tracking
Spatio-Temporal
Rule Mining
Figure 3.1 illustrates our generalized framework applied to processing of physicallybased simulation data. We contend that a common framework can compactly store and
analyze data of evolutionary phenomena. We assume that certain locally computable
quantities can detect precursor events. Our approach is novel in its flexibility and applicability across disciplines. The shape-based analysis converts the task of data management and analysis into one of choosing robust shape descriptors and being able to
index features from a catalog. The descriptors will be derived from the application.
In addition to feature detection algorithms, aggregation or segmentation, tracking
and characterization algorithms must be utilized in conjunction with traditional data
mining algorithms to facilitate cataloging detected structures and expediting searches
M ACHIRAJU ,
ET AL .
197
multidimensional shape space. The descriptors for a MD simulation can include the
number of atoms involved, their orientations, the connectivity between atoms, the trajectory, and history of its evolution. In a CFD simulation, vortices, the type of feature
of importance for respiratory flows, can be characterized by their strength and sense
of rotation as well as obvious geometrical parameters such as position, shape, and
extent. These features can be categorized by notions of similarity. Shape categories
enable synergistic understanding of events and features in the MD and CFD domains.
To compute the similarity between shapes or structures we rely on spatial geometric
hashing [Wolfson & Rigotsos1997] and clustering algorithms [Jain & Dubes1988]. To
categorize the structures we rely on classification algorithms [Quinlan1996] using the
generalized shape descriptors as input to the classifiers. For CFD data we employ a generalized shape descriptor for swirling regions and propose hierarchical shape matching
algorithms.
A third component of feature mining is corresponding and tracking of features
over time. The generation of new features and destruction of existing features pose
major challenges to effective, feature-tracking algorithms. The essential problem is to
determine how the position of a particular feature changes during a given time interval.
In our datasets, this is non-trivial since fissures and fusions of features are extremely
common. Furthermore, the structural descriptors of the same feature may change over
time. Tracking and correspondence complete the construction of the multi-dimensional
shape space for a given application. Relevant related work in feature tracking was
reported in [Samtaney et al.1994, Silver & Wang1997]. Shapes were not considered
therein and the method is, in general, expensive. Similarly in [Reinders, Post, &
Spoelder1999,Reinders, Jacobson, & Post2000] the skeleton or an approximate medial
axis was computed for vortices. However, this representation is very crisp and does not
allow tangible matching and tracking. In [Thampy2003], a predictive algorithm was
developed that utilized the evolution of selected kinematical and dynamical properties
to enhance confidence in the correspondence algorithm.
Mining for Spatial and Spatio-Temporal Patterns
Over any time interval in a simulation, we need techniques that can identify important
spatial patterns efficiently. Some patterns can be complex and not necessarily sequential. The aim is to derive predictive rules: combinations of features resulting in certain
events, (e.g., fusion or fissure). To derive such rules requires identifying frequently
occurring spatial patterns. Clustering, association [Agrawal et al.1996], and sequential pattern analysis [Parthasarathy et al.1999], and spatio-temporal analysis [Vlachos, Kollios, & Gunopulos2002] will be used to determine the important patterns.
Our eventual goal is to correlate information from a shape categorization together with
transition detection mechanisms to help discover novel axioms relating to the evolution
of shapes over time. An example of such an axiom could be a type-A feature evolves
into a type-B feature through some particular mechanism. Such rules can be found
using event-based sequential and association pattern analysis. Equally important is to
identify those axioms that dominate the particular simulation type. These data mining
algorithms will operate on the shape space constructed in an earlier step and produce
explanations of feature behavior and evolution.
M ACHIRAJU ,
ET AL .
199
the features, namely edges. This suggests that a blend of data- and feature-mining
methods might have the potential to reduce the burdensome chore of finding features
in large datasets.
Point classification techniques
The first feature detection paradigm, which we call point classification, requires several
operations in sequence:
This approach identifies individual points as belonging to a feature and then aggregates them to identify regions that are features. The points are obtained from a tour of
the discrete domain and can be in many cases the grid points of a physical grid (CFD)
or a lattice (MD). The operator used in the detection step and the criteria used in the
classification step embody physically based point-wise characteristics of the feature of
interest. In this context, classification accords membership of a discrete point in the
dataset to a feature.
Aggregate classification techniques
We can best incorporate the global information needed to define a vortex into our second feature detection paradigm, the aggregate classification approach. Aggregate classification follows a somewhat different sequence of operations:
Figure 3.2: The results of our point classification algorithm applied to a delta wing
dataset. The front and top views respectively are shown. The yellow regions indicate
regions of swirling flow. There exist several regions which are falsely classified.
M ACHIRAJU ,
ET AL .
201
Figure 3.3: The results of our aggregate classification technique applied to the delta
wing dataset. (left) All candidate core regions are shown. The verified cores are shown
in yellow while the spurious ones are shown in green. (middle) Streamline tracing
around verified cores. (right) The top image shows the verification algorithm at work
through seeding and tracing, while the bottom image shows illustrates the use of projections and angles to verify vortices.
and it is precisely these flow patterns that we search for in the computational grid. Not
surprisingly, our approach is related to critical point theory. However, critical points
alone are not sufficient to detect a vortex. For each grid point, our algorithm examines
its immediate neighbors to see whether the neighboring velocity vectors point in three
or more direction ranges. The novelty of this method is its relative insensitivity to core
direction. Therefore, very approximate core directions may be used in the detection
step.
Our technique segments candidate core regions by aggregating points identified
from the detection phase. We then classify (or verify) these candidate core regions
based on the existence of swirling streamlines surrounding them. (For features that
lack a formal definition, such as the vortex, we must choose the verification criteria
so that it concurs with the intuitive understanding of the feature. In this case, verifying whether a candidate core region is a vortex core region requires checking for any
swirling streamlines surrounding it.) Checking for swirling flow in three dimensions
is a nontrivial problem since vortices can bend and twist. The technique we developed
essentially checks to see if the local tangent to the streamline, when projected onto the
plane normal to the local core tangent, spans 2
. The aggregate nature of this classification step is apparent. Checking for swirling streamlines is a global (or aggregate)
approach to feature classification (or verification) because swirling is measured with
respect to the core region, not just individual points within the core region. Figure 3.3
describes all steps of this paradigm.
M ACHIRAJU ,
ET AL .
203
Figure 3.4: Black atoms are defect atoms. Top is a structure identified at 1000K.
Bottom is the same structure quenched using first principles. Even though the atoms
in the top structure are displaced due to thermal noise, the same atoms are marked as
defect in both structures.
Figure 3.5: Two separated defects: black atoms are one cluster, grey atoms are different
cluster.
Figure 3.6: These two structures have a different number of defect atoms marked.
When quenched however they are the same structure.
quenched are the same. Additionally, we are still exploring robust and viable shape
descriptors and matching algorithms for MD data.
M ACHIRAJU ,
ET AL .
205
son2002b] demonstrate the ability to identify regions of swirling flow in complex threedimensional flow fields.
Consideration of time-varying data introduces additional complexity through the
need for tracking of features. According to [Samtaney et al.1994], five distinct evolutionary events can occur to features in scientific simulations: continuation, creation,
dissipation, bifurcation, and amalgamation. Each of these processes must be accounted
for in the tracking algorithm. The work in [Silver & Wang1997] is applicable for general three-dimensional tracking of features. Other solutions to this problem exploit
hierarchical data structures [Carr, Snoeyink, & Axen2000, Shen, Chiang, & Ma1999].
3.6 Summary
The steady increase in computing power available for science and engineering problems challenges our ability to learn new science from the massive data. We have
proposed and are developing a generalized framework that facilitates the analysis of
large-scale simulation data for time-varying, evolutionary phenomena. The key component of our approach is an abstract shape-based description of the relevant features.
This abstract notion of shape allows us to apply more general data mining algorithms
to the extracted features and their characteristics.
Our flexible approach is motivated by two disparate applications respiratory flow
and material defect simulation. Both drivers raise central issues that the components
of the framework will necessarily address:
Feature mining
It should be noted that both science drivers have commonalities that are exploited by
the techniques listed above.
Preliminary results have been very encouraging. However, more remains to be
done to realize the complete unified framework. A systematic approach to feature mining was conceived to locate both local and global features. Currently, tracking features
in a time-varying dataset is being investigated. Similarly, we are conceiving a comprehensive framework that will allow one to derive appropriate associations between
the occurrence of transitionary events and the change in feature demographics. This
framework will also include environmental parameters such as the underlying geometry. Also, of interest is the creation of tools which will control both the feature- and
data-mining exercises. It is our belief that our proposed framework is likely to garner
new insights from massive simulation datasets and allow for a better understanding of
the underlying physical phenomena.
Bibliography
[Agrawal et al.1996] Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.; and
Verkamo, A. I. 1996. Fast discovery of association rules. In et al., U. F., ed.,
Advances in Knowledge Discovery and Data Mining. MIT Press.
[Arai, Takeda, & Kohyama1997] Arai, N.; Takeda, S.; and Kohyama, M. 1997. Selfinterstitial clustering in crystalline silicon. Phys. Rev. Lett. 78:4265.
[Banks & Singer1995] Banks, D. C., and Singer, B. A. 1995. A Predictor-Corrector
Technique for Visualizing Unsteady Flow. IEEE Transactions on Visualization and
Computer Graphics 1(2):151163.
[Berdahl & Thompson1993] Berdahl, C. H., and Thompson, D. S. 1993. Eduction of
Swirling Structure using the Velocity Gradient Tensor. AIAA J. 31(1):97103.
[Burl et al.1998] Burl, M.; Asker, L.; Smyth, P.; Fayyad, U.; Perona, P.; Aubele, J.;
and Crumpler, L. 1998. Learning to recognize volcanos on venus. In Machine
Learning, 165195.
[Carr, Snoeyink, & Axen2000] Carr, H.; Snoeyink, J.; and Axen, U. 2000. Computing contour trees in all dimensions. In Proc. 11th ACM/SIAM Symp. on Discrete
Algorithms.
[Cowern et al.1999] Cowern, N. E. B.; Mannino, G.; Stolk, P. A.; Roozeboom, F.;
Huizing, H. G. A.; van Berkum, J. G. M.; Cristiano, F.; Claverie, A.; and Jaraiz, M.
1999. Energetics of self-interstitial clusters in si. Phys. Rev. Lett. 82:4460.
[Dehaspe, Toivonen, & King1998] Dehaspe, L.; Toivonen, H.; and King, R. 1998.
Finding frequent substructures in chemical compounds. In International Conference
on Knowledge Discoverya and Data Mining.
[Gatlin et al.1995] Gatlin, B.; Cuicchi, C. E.; Hammersley, J. R.; Olsen, D. E.; Reddy,
R. N.; and Burnside, G. G. 1995. Computational simulation of steady and oscillating
flow in branching tubes. In The 1995 ASME/JSME Fluids Engineering and Laser
Anemometry Conference and Exhibition: Bio-Medical Fluids Engineering, volume
FED-212, 18. American Society of Mechanical Engineers. Hilton Head, SC.
[Gatlin et al.1997a] Gatlin, B.; Cuicchi, C. E.; Hammersley, J. R.; Olsen, D. E.;
Reddy, R. N.; and Burnside, G. G. 1997a. Computation of converging and diverging flow through an asymmetric tubular bifurcation. In The 1997 ASME Fluids
207
M ACHIRAJU ,
ET AL .
209
[Marusic et al.2001] Marusic, I.; Chandler, G. . V.; Interrante, V.; Subbareddy, P. K.;
and Moss, A. 2001. Real Time Feature Extraction For the Analysis of Turbulent
Flows. In et al., R. L. G., ed., Data Mining for Scientific and Engineering Applications, 223238. Kluwer Academic Publishers.
[Montalenti, Srensen, & Voter2001] Montalenti, F.; Srensen, M.; and Voter, A.
2001. Closing the gap between experiment and theory: Crystal growth by temperature accelerated dynamics. Phys. Rev. Lett. 87:126101.
[Parthasarathy & Coatney2002] Parthasarathy, S., and Coatney, M. 2002. Efficient
discovery of common substructures in macromolecules. In IEEE International Conference on Data Mining.
[Parthasarathy et al.1999] Parthasarathy, S.; Zaki, M.; Ogihara, M.; and Dwarkadas,
S. 1999. Incremental and interactive sequence mining. ACM Confereince on Information and Knowledge Management (CIKM).
[Portela1997] Portela, L. M. 1997. On the identification and classification of vortices.
Ph.D. Dissertation, Stanford University.
[Quinlan1996] Quinlan, J. R. 1996. Induction of decision trees. Machine Learning
5(1):71100.
[Reinders, Jacobson, & Post2000] Reinders, F.; Jacobson, M. E. D.; and Post, F. H.
2000. Skeleton Graph Generation for Feature Shape Description. In Joint
Eurographics-IEEE TCVG Symposium on Visualization, 7382.
[Reinders, Post, & Spoelder1999] Reinders, F.; Post, F. H.; and Spoelder, H. J. W.
1999. Attribute-Based Feature Tracking. In Joint Eurographics-IEEE TCVG Symposium on Visualization, 6372.
[Richie et al.2002] Richie, D.; Kim, J.; Hazzard, R.; Hazzard, K.; Barr, S.; and
Wilkins, J. 2002. Large-scale molecular dynamics simulations of interstitial defect diffusion in silcon. volume 731, W9.10. Material Research Society.
[Richie, Kim, & Wilkins2001] Richie, D.; Kim, J.; and Wilkins, J. 2001. Applications of real-time multiresolution analysis for molecular dynamics simulations of
infrequent events. volume 677, AA5.1. Material Research Society.
[Samtaney et al.1994] Samtaney, R.; Silver, D.; Zabusky, N.; and Cao, J. 1994. Visualizing Features and Tracking Their Evolution. IEEE Computer 27(7):2027.
[Shen, Chiang, & Ma1999] Shen, H.-W.; Chiang, L.; and Ma, K.-L. 1999. TimeVarying Volume Rendering Using a Time-Space Partitioning Tree. In Proceedings
of Visualization 99, 371378.
[Silver & Wang1997] Silver, D., and Wang, X. 1997. Tracking and Visualizing Turbulent 3D Features. IEEE Transactions on Visualization and Computer Graphics
3(2).