Textbook Information Technology in Bio and Medical Informatics 5Th International Conference Itbam 2014 Munich Germany September 2 2014 Proceedings 1St Edition Miroslav Bursa Ebook All Chapter PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Information Technology in Bio and

Medical Informatics 5th International


Conference ITBAM 2014 Munich
Germany September 2 2014
Proceedings 1st Edition Miroslav Bursa
Visit to download the full and correct content document:
https://textbookfull.com/product/information-technology-in-bio-and-medical-informatics
-5th-international-conference-itbam-2014-munich-germany-september-2-2014-procee
dings-1st-edition-miroslav-bursa/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Trust Privacy and Security in Digital Business 11th


International Conference TrustBus 2014 Munich Germany
September 2 3 2014 Proceedings 1st Edition Claudia
Eckert
https://textbookfull.com/product/trust-privacy-and-security-in-
digital-business-11th-international-conference-
trustbus-2014-munich-germany-september-2-3-2014-proceedings-1st-
edition-claudia-eckert/

Informatics in Control Automation and Robotics 11th


International Conference ICINCO 2014 Vienna Austria
September 2 4 2014 Revised Selected Papers 1st Edition
Joaquim Filipe
https://textbookfull.com/product/informatics-in-control-
automation-and-robotics-11th-international-conference-
icinco-2014-vienna-austria-september-2-4-2014-revised-selected-
papers-1st-edition-joaquim-filipe/

Runtime Verification 5th International Conference RV


2014 Toronto ON Canada September 22 25 2014 Proceedings
1st Edition Borzoo Bonakdarpour

https://textbookfull.com/product/runtime-verification-5th-
international-conference-rv-2014-toronto-on-canada-
september-22-25-2014-proceedings-1st-edition-borzoo-bonakdarpour/

Serious Games Development and Applications 5th


International Conference SGDA 2014 Berlin Germany
October 9 10 2014 Proceedings 1st Edition Minhua Ma

https://textbookfull.com/product/serious-games-development-and-
applications-5th-international-conference-sgda-2014-berlin-
germany-october-9-10-2014-proceedings-1st-edition-minhua-ma/
Computational Logistics 5th International Conference
ICCL 2014 Valparaiso Chile September 24 26 2014
Proceedings 1st Edition Rosa G. González-Ramírez

https://textbookfull.com/product/computational-logistics-5th-
international-conference-iccl-2014-valparaiso-chile-
september-24-26-2014-proceedings-1st-edition-rosa-g-gonzalez-
ramirez/

Engineering Secure Software and Systems 6th


International Symposium ESSoS 2014 Munich Germany
February 26 28 2014 Proceedings 1st Edition Jan Jürjens

https://textbookfull.com/product/engineering-secure-software-and-
systems-6th-international-symposium-essos-2014-munich-germany-
february-26-28-2014-proceedings-1st-edition-jan-jurjens/

Scalable Information Systems 5th International


Conference INFOSCALE 2014 Seoul South Korea September
25 26 2014 Revised Selected Papers 1st Edition Jason J.
Jung
https://textbookfull.com/product/scalable-information-
systems-5th-international-conference-infoscale-2014-seoul-south-
korea-september-25-26-2014-revised-selected-papers-1st-edition-
jason-j-jung/

Knowledge Engineering and the Semantic Web 5th


International Conference KESW 2014 Kazan Russia
September 29 October 1 2014 Proceedings 1st Edition
Pavel Klinov
https://textbookfull.com/product/knowledge-engineering-and-the-
semantic-web-5th-international-conference-kesw-2014-kazan-russia-
september-29-october-1-2014-proceedings-1st-edition-pavel-klinov/

Supercomputing 29th International Conference ISC 2014


Leipzig Germany June 22 26 2014 Proceedings 1st Edition
Julian Martin Kunkel

https://textbookfull.com/product/supercomputing-29th-
international-conference-isc-2014-leipzig-germany-
june-22-26-2014-proceedings-1st-edition-julian-martin-kunkel/
Miroslav Bursa
Sami Khuri
M. Elena Renda (Eds.)

Information Technology
LNCS 8649

in Bio- and
Medical Informatics
5th International Conference, ITBAM 2014
Munich, Germany, September 2, 2014
Proceedings

123
Lecture Notes in Computer Science 8649
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Miroslav Bursa Sami Khuri
M. Elena Renda (Eds.)

Information Technology
in Bio- and
Medical Informatics
5th International Conference, ITBAM 2014
Munich, Germany, September 2, 2014
Proceedings

13
Volume Editors
Miroslav Bursa
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Cybernetics
Technicka 2
166 27 Prague 6, Czech Republic
E-mail: [email protected]
Sami Khuri
San Jose State University
Department of Computer Science
One Washington Square
San Jose, CA 95192-0249, USA
E-mail: [email protected]
M. Elena Renda
Istituto di Informatica e Telematica del CNR
Via G. Moruzzi 1
56124 Pisa, Italy
E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349


ISBN 978-3-319-10264-1 e-ISBN 978-3-319-10265-8
DOI 10.1007/978-3-319-10265-8
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014945811

LNCS Sublibrary: SL 3 – Information Systems and Application,


incl. Internet/Web and HCI
© Springer International Publishing Switzerland 2014
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in ist current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

Biomedical engineering and medical informatics represent challenging and rapidly


growing areas. Applications of information technology in these areas are of
paramount importance. Building on the success of the ITBAM 2010, ITBAM 2011,
ITBAM 2012, and ITBAM 2013, the aim of the 5th ITBAM conference was to
continue bringing together scientists, researchers, and practitioners from dif-
ferent disciplines, namely, from mathematics, computer science, bioinformatics,
biomedical engineering, medicine, biology, and different fields of life sciences, so
they can present and discuss their research results in bioinformatics and medical
informatics. We hope that ITBAM will serve as a platform for fruitful discus-
sions between all attendees, where participants can exchange their recent results,
identify future directions and challenges, initiate possible collaborative research
and develop common languages for solving problems in the realm of biomed-
ical engineering, bioinformatics, and medical informatics. The importance of
computer-aided diagnosis and therapy continues to draw attention worldwide
and has laid the foundations for modern medicine with excellent potential for
promising applications in a variety of fields, such as telemedicine, Web-based
healthcare, analysis of genetic information, and personalized medicine.
Following a thorough peer-review process, we finally selected 9 long papers for
oral presentation and 3 short papers for poster session for the 5th annual ITBAM
conference (7 were rejected). The Organizing Committee would like to thank the
reviewers for their excellent job. The articles can be found in the proceedings
and are divided in the following sections: Clustering and Bioinformatics; Medical
Image and Data Processing; Knowledge Discovery and Machine Learning in
Medicine. The papers show how broad the spectrum of topics in applications of
information technology to biomedical engineering and medical informatics is.
The editors would like to thank all the participants for their high-quality
contributions and Springer for publishing the proceedings of this conference.
Once again, our special thanks go to Gabriela Wagner for her hard work on
various aspects of this event.

June 2014 Miroslav Bursa


M. Elena Renda
Sami Khuri
Organization

General Chair
Christian Böhm University of Munich, Germany

Program Committee Co-chairs


Miroslav Bursa Czech Technical University in Prague,
Czech Republic
Sami Khuri San José State University, USA
M. Elena Renda IIT - CNR, Pisa, Italy

Program Committee
Werner Aigner FAW, Austria
Fuat Akal Functional Genomics Center Zurich,
Switzerland
Tatsuya Akutsu Kyoto University, Japan
Andreas Albrecht Queen’s University Belfast, Ireland
Peter Baumann Jacobs University Bremen, Germany
Balaram Bhattacharyya Visva-Bharati University, India
Veselka Boeva Technical University of Plovdiv, Bulgaria
Roberta Bosotti Nerviano Medical Science s.r.l., Italy
Rita Casadio University of Bologna, Italy
Sònia Casillas Universitat Autònoma de Barcelona, Spain
Kun-Mao Chao National Taiwan University, Taiwan
Vaclav Chudacek Czech Technical University in Prague,
Czech Republic
Hans-Dieter Ehrich Technical University of Braunschweig,
Germany
Christoph M. Friedrich University of Applied Sciences Dortmund,
Germany
Alejandro Giorgetti University of Verona, Italy
Jan Havlik Dep. of Circuit Theory, FEE, Czech Technical
University in Prague, Czech Republic
Volker Heun Ludwig-Maximilians-Universität München,
Germany
Larisa Ismailova NRNU MEPhI, Russia
Alastair Kerr University of Edinburgh, UK
VIII Organization

Michal Krátký Technical University of Ostrava,


Czech Republic
Vaclav Kremen Czech Technical University in Prague,
Czech Republic
Jakub Kuzilek Czech Technical University, Czech Republic
Gorka Lasso CIC bioGUNE, Spain
Lenka Lhotska Czech Technical University, Czech Republic
Roger Marshall Plymouth State University, USA
Elio Masciari ICAR-CNR, Università della Calabria, Italy
Erika Melissari University of Pisa, Italy
Henning Mersch RWTH Aachen University, Germany
Jean-Christophe Nebel Kingston University, UK
Vit Novacek National University of Ireland, Ireland
Nadia Pisanti University of Pisa, Italy
Cinzia Pizzi Università degli Studi di Padova, Italy
Clara Pizzuti (ICAR)-National Research Council (CNR),
Italy
Nicole Radde Universität Stuttgart, Germany
Stefano Rovetta University of Genova, Italy
Huseyin Seker De Montfort University, UK
Jiri Spilka Czech Technical University in Prague,
Czech Republic
Kathleen Steinhofel King’s College London, UK
Karla Stepanova Czech Technical University, Czech Republic
Roland R. Wagner University of Linz, Austria
Viacheslav Wolfengagen Institute JurInfoR-MSU, Russia
Borys Wrobel Polish Academy of Sciences, Poland
Filip Zavoral Charles University in Prague, Czech Republic
Songmao Zhang Chinese Academy of Sciences, China
Qiang Zhu The University of Michigan, USA
Table of Contents

Clustering and Bioinformatics


BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species with
DNA Signatures through Metagenomics Samples . . . . . . . . . . . . . . . . . . . . . 1
Ramin Karimi, Ladjel Bellatreche, Patrick Girard,
Ahcene Boukorca, and Andras Hajdu

Centroid Clustering of Cellular Lineage Trees . . . . . . . . . . . . . . . . . . . . . . . . 15


Valeriy Khakhutskyy, Michael Schwarzfischer, Nina Hubig,
Claudia Plant, Carsten Marr, Michael A. Rieger,
Timm Schroeder, and Fabian J. Theis

A Discussion on the Biological Relevance of Clustering Results . . . . . . . . 30


Pietro Hiram Guzzi, Elio Masciari,
Giuseppe Massimiliano Mazzeo, and Carlo Zaniolo

Medical Image and Data Processing


Segmentation and Kinetic Analysis of Breast Lesions in DCE-MR
Imaging Using ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Sebastian Goebl, Anke Meyer-Baese, Marc Lobbes, and Claudia Plant

Quantitative Fetal Growth Curves Comparison: A Collaborative


Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Mario A. Bochicchio, Lucia Vaira, Antonella Longo,
Antonio Malvasi, and Andrea Tinelli

Poster Session
Knowledge Reasoning Model to Support Clinical Decision Making . . . . . . 75
Qingshan Li, Jing Feng, Lu Wang, Hua Chu, and WeiJuan Fu

Method for Knowledge Acquisition and Decision-Making Process


Analysis in Clinical Decision Support System . . . . . . . . . . . . . . . . . . . . . . . . 79
Qingshan Li, Jing Feng, Lu Wang, Hua Chu, and He Yu

Towards the Integration of the Knowledge from Biomedical


Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Eshref Januzaj
X Table of Contents

Knowledge Discovery and Machine Learning in


Medicine
Pervasive and Intelligent Decision Support in Intensive Medicine –
The Complete Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Filipe Portela, Manuel Filipe Santos, José Machado,
António Abelha, Álvaro Silva, and Fernando Rua

Mining Medical Data to Obtain Fuzzy Predicates . . . . . . . . . . . . . . . . . . . . 103


Taymi Ceruto, Orenia Lapeira, Annika Tonch, Claudia Plant,
Rafael Espin, and Alejandro Rosete

On Patient’s Characteristics Extraction for Metabolic Syndrome


Diagnosis: Predictive Modelling Based on Machine Learning . . . . . . . . . . . 118
František Babič, Ljiljana Majnarić, Alexandra Lukáčová,
Ján Paralič, and Andreas Holzinger

An Evolutionary Method for Exceptional Association Rule Set


Discovery from Incomplete Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Kaoru Shimada and Takashi Hanioka

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


BINOS4DNA: Bitmap Indexes and NoSQL
for Identifying Species with DNA Signatures
through Metagenomics Samples

Ramin Karimi1,2 , Ladjel Bellatreche1 , Patrick Girard1 , Ahcene Boukorca1,


and Andras Hajdu2
1
LIAS/ISAE-ENSMA, Poitiers University, Futuroscope, France
{bellatreche,girard,ahcene.boukorca}@ensma.fr
2
Faculty of Informatics, Debrecen University, Hungary
{ramin.karimi,hajdu.andras}@inf.unideb.hu

Abstract. The advancement of next generation sequencing (NGS) and


shotgun sequencing technologies produced massive amounts of genomics
data. Metagenomics, a powerful technique to study genetic material of
uncultivable microorganisms received directly from their natural
environment, is dealing with high throughput sequencing read data sets.
Assembling, binning and alignment of short reads in order to identify mi-
croorganisms of a Metagenomics sample are expensive and time-
consuming, regardless of other restrictions. DNA signature is a short
nucleotide sequence fragment which is used to distinguish species across
all other species. It can be a basis for identifying microorganisms both in
environmental and clinical samples directly from the short reads, without
assembling and alignment processes. In this paper, we propose a scalable
method in which we use optimization techniques borrowed from database
technology, namely bitmap indexes. They are used to speed up searching
and matching of billions of DNA signatures in the short reads of thou-
sands of different microorganisms, using commodity High Performance
Computing, such as Hadoop MapReduce, Hive and Hbase.

Keywords: Metagenomics, Short Reads, DNA signature, Hadoop and


MapReduce, Hive, Bitmap Index, Hbase.

1 Introduction
At the age of Whole Genome Shotgun (WGS) sequencing and information tech-
nology, development of new techniques and applications in biology to study
microorganisms is highly demanded in both clinical and environmental commu-
nities. The number of existing microbial species is estimated at 105 to 106 [1, 2].

This work was performed when Ramin Karimi was visiting the LIAS/ISAE-ENSMA
Lab. This visit is funded by ERASMUS mobility program. The work was also sup-
ported in part by the projects TMOP-4.2.2.C-11/1/KONV-2012-0001, and TMOP
4.2.4. A/2-11-1-2012-0001 supported by the European Union, co-financed by the
European Social Fund, and by the OTKA grant NK101680.

M. Bursa et al. (Eds.): ITBAM 2014, LNCS 8649, pp. 1–14, 2014.

c Springer International Publishing Switzerland 2014
2 R. Karimi et al.

The majority (> 99%) of microorganisms from the environment resist cultivation
in the laboratory [3] and it was impossible to investigate them until a few years
ago. With advances of next generation sequencing (NGS) and Metagenomics
techniques in the last few years, it is possible to obtain directly the genetic
content of all organisms with their complex communities gathered from natural
environment in which they normally live.
The output of sequencing technology is short fragments of DNA sequence with
25 base pairs (bp) to 900 (bp) lengths, called short reads. They vary from one
sequencing technology to another. For instance, sequencing machines made by
Illumina, Applied Biosystems (ABI), and Helicos of Cambridge produce short
sequences of 25 to 100 (bp).
Long DNA molecules extracted from the sample, are broken into smaller pieces
by special fragmentation and cloning techniques. Then, these small pieces are
fed into the sequencer for determining the order of nucleotides in short fragments
of DNA [4]. Sequencing output for a Metagenome sample is enormous data sets
containing the short reads of hundreds to thousands of known and unknown
organisms. Having efficient implementations to facilitate the analysis process is
urgently required in both biological and computational parts of any Metage-
nomics project. Figure 1 details the steps involved in a typical sequence-based
Metagenome project [5].

Fig. 1. A typical Metagenome project flow diagram. Dashed arrows indicate steps that
can be omitted.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 3

Sequence-based identification of species can be classified into two groups: As-


sembly and alignment-based approaches on one hand, alignment-free identifica-
tion approaches on the other hand [6].
Assembly is used to construct a complete genome of a species by search-
ing and matching the overlapping parts of the short reads and merging them
together. Whereas, alignment is used to reconstruct the whole genome of previ-
ously known species using a reference genome as the map to find the similarities
of the reads in different genome regions with considering the structure, function
and evolutionary relationship between the reads and the reference sequences.
Besides time and money consuming, technical challenges of alignment and
assembly programs are also considerable. Sequenced reads are short in length
and large in volume, very noisy and partial, with too many missing parts [7].
Reads contain sequencing errors caused by the sequencers. Moreover, repetitive
elements in the DNA sequence of species are another challenge of alignment and
assembly. As an example about half of the human genome is covered by repeats
[8]. These challenges cause computational complexity and create obscurity and
errors for interpreting the results in alignment and assembly based identification.
Phylogenetic analysis mostly uses multiple alignments of sequences [9, 10]
which are suitable to compare large sets of sequences all together. However,
methods of multiple sequence alignment, in addition to all the above restrictions,
are still computationally very expensive and require considerable computational
tools and applications such as server resources [11].
Thus, there is an essential need to develop efficient alignment-free methods
for phylogenetic analysis and identification of species in Metagenomics in order
to reduce the computational complexity, time and cost.
Due to the above challenges, alignment of whole shotgun genome sequenc-
ing reads is difficult and no method have been developed to compare genomes
directly from reads data, without assembly [13].
Most of the alignment-free methods use word frequencies, where words are
small fragments of sequence called k -mers or n-grams in the literature, in which
k and n are fixed length of the oligonucleotide to represent a sequence [12–14].
DNA Signature is a unique small fragment of nucleotides sequence used
to detect a target organism among all others. It can be a good solution for
real-time identification of species. There exists methods for detecting hundreds
to hundreds of thousands of signatures with different lengths of nucleotide for
every species using k -word frequencies and pattern comparison base methods
[10], [15–17].
Using DNA signatures in the isolated sample studies and Polymerase Chain
Reaction (PCR) base detection is easy to perform, because of low number of tar-
gets. But in the Metagenomics studies it is much more complicated. Taking into
account the number of signatures, short reads and organisms in the Metagenome
samples, it is obvious that we are facing massive data sets. Using ordinary hard-
ware and software tools is impossible, since it takes a long time regardless of any
failure during the process.
4 R. Karimi et al.

In this paper, we propose a method to show how parallel and distributed


computing and Bitmap Indexing technique can solve this problem. This paper
is organized as follows. Section 2 presents all ingredients related to high perfor-
mance computing and bitmap indexes to detail our proposal. Section 3 describes
our methodology. In Section 4, experiments are conducted to show the efficiency
and effectiveness of our approach. Section 5 concludes the paper by summarizing
the main results of our finding and discussing some perspective issues.

2 Background
In this section, we review the technologies and the concepts that we use in our
methodology.
Advances in parallel and distributed computing have opened new doors for
many researchers who could not access high performance computers (HPC). The
Apache Hadoop software library [18–20] is an open source framework, written in
Java. Hadoop, the application of parallel and distributed computing allows run-
ning simple programming models on large data sets across the nodes of a cluster.
The idea behind designing Hadoop is to store and run big data on commodity
hardware cluster nodes instead of expensive high performance computers which
are not available for everybody.
Hadoop handles any type of data from structured, unstructured, text files, log
files, images, audio files, communications records, etc. A Hadoop cluster has a
single Master and several Slave nodes. It can run as a single node cluster or multi
node cluster with thousands of nodes. The Hadoop core has two components:
Hadoop Distributed File System (HDFS) and MapReduce.

2.1 Hadoop Distributed File System (HDFS)


HDFS is the storage part of Hadoop. it designed to store and support the high-
throughput access of very large data sets across multi-node cluster [18–21]. HDFS
has three main components:

– NameNode: It is the Master of the filesystem. It is responsible to manage


the blocks in DataNodes and maintains the metadata and indexes of the
blocks, but not the data itself.

– DataNodes: They are the workhorses of the filesystem. NameNode breaks


down data into block-sized chunks, which are stored as independent units in
DataNod, 64 MB by default.

– Secondary NameNode: It keeps a copy of the merged namespace image,


which can be used in case of any failure for the NameNode.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 5

2.2 MapReduce
MapReduce is a programming model for data processing. It works by breaking
the process into two phases: the map phase and the reduce phase [18, 19]. The
two main components of MapReduce are:

– JobTracker: As the Master of the system, it is responsible to manage the


map and reduce tasks.

– TaskTracker: As the slave, it receives the mapper and reducer task from
JobTracker and returns the results to the JobTracker after execution.

Hadoop is highly fault tolerant. In order to prevent any failure in the process,
HDFS creates multiple copies of data through the blocks, 3 copies by default.
NameNode can detect any failure in DataNodes or blocks and JobTracker also
can detect any failure of TaskTrackers and will replace them.

2.3 NoSQL
”NoSQL” Stands for Not Only SQL. The term ”NoSQL” was used by Carlo
Strozzi for the first time in 1998 [22]. It is a non-relational database [27]. One
of the aspects of NoSQL is its ability to handle database analytics of big data
sets in parallel and distributed platforms like Hadoop on commodity hardware.
Hive and Hbase are types of NoSQL applications on top of Apache Hadoop file
system. NoSQL databases can handle unstructured data such as text files, log
files, email, social media and multimedia. Horizontal scaling is one of the most
important features of NoSQL databases, and allows us to add more nodes to our
distributed system. Vertical scaling only allows to increase the power of existing
machine [23, 24].

2.4 Hive
Hive [19], [25] is a data warehousing infrastructure on top of Hadoop and HDFS.
HiveQL which is a SQL-like language, simplifies querying of unstructured large
datasets in distributed storage. Hive is designed to write once and read several
times. Real-time queries and row-level update are not possible. Hive is easy
to implement for everybody who is familiar with SQL queries. Facebook Data
Infrastructure Team started to create Hive in January 2007 to bring the familiar
concepts of tables, columns, partitions and a subset of SQL to the unstructured
world of Hadoop and it was open sourced in August 2008 [26]. Hive support
Bitmap Index from version 0.08.

2.5 Bitmap Index


Bitmap Index [28, 29] is an efficient way to speed up the queries and improve
performance in datawarehouse environments, which contain tables with low car-
dinality columns. As the example given in Table 1, we index the values of the
6 R. Karimi et al.

column Grade having low cardinality. In this case our index has the same num-
ber of rows and the number of columns is equal to the number of distinct values
in column Grade. In table 1, cardinality of the column Grade is 4 because we
have 4 different values in it.

Table 1. An example of a bitmap index defined on Grade column

RID Name Nationality Grade RID A B C D


1 John FRANCE B 1 0 1 0 0
2 Sara USA D 2 0 0 0 1
3 Piter RUSSIA C 3 0 0 1 0
4 David ENGLAND A 4 1 0 0 0
5 Tania GERMANY B 5 0 1 0 0
6 Daniel POLAND A 6 1 0 0 0
7 Tom CANADA C 7 0 0 1 0
8 Robert ITALY C 8 0 0 1 0
9 Jain FRANCE D 9 0 0 0 1

2.6 Hbase
Hbase [27] is a type of NoSQL database. It is an open-source, distributed,
column-oriented and scalable database built on the top of the Hadoop file sys-
tem. It is designed for random, real-time read/write access to very large tables
with billions of rows and millions of columns on commodity hardware.

3 Our Methodology
We have downloaded all complete Bacterial genomes from the National Cen-
ter for Biotechnology Information (NCBI) database [30]. The total number of
genomes was 2773 bacterial species and subspecies at the time (16.01.2014).

3.1 Insignia
Insignia is a pipeline to generate unique DNA signatures and it is also a database
and web application for obtaining DNA signatures. It contains 11274 viruses/
phages and 2653 non-viruses signatures with a length between 18 to 500 bp.
Insignia detect signatures for designing primers in Polymerase Chain Reaction
(PCR) and probes in micro-array technologies. The signatures can also be used
for real-time identification of species in microbial and viral assays [15, 16], [31].
We downloaded DNA signatures for two groups of 50 bacteria from the in-
signia database. As we are in the testing process, we just downloaded the signa-
tures with length of 18 bp. As an example, Table 2 consists of the head part of
Acholeplasma laidlawii DNA signatures.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 7

Table 2. A part of Acholeplasma laidlawii’s DNA signatures table of unique 18-mers,


downloaded from the Insignia database

Insignia V0.7
Signatures calculated: Thu Mar 6 2014 10:58:06
Reference Organism:
Acholeplasma laidlawii PG-8A
Target Organism(s):
Signatures:
Index Start stop sequence
63965 451703 451720 ACATAAGCAGGTGCGGAA
63966 670606 670623 GATACCAATACCGCAGAT
63967 692909 692926 CCCATTCAACTTCGATCA
63968 530281 530298 ATCAACGCTAGATGAGCA
63969 268209 268226 ATGGAGGAGTCTGGATAC
63970 69763 69780 ACAGCAACAGCGTATATC
63971 357337 357354 GTGTTAGCGTTAAGTCTG
63972 1001550 1001567 TAGCCTCTTTAAGCAGGT
63973 1366201 1366218 ATGATGCAAGTGGCATGG
63974 1141698 1141715 TGCAACGGATGCATCAAG

3.2 Metasim
Metasim is a sequencing simulator application for genomics and Metagenomics
studies. It can be a great help to develop and improve Metagenomics tools, and
for planning Metagenomics projects [32, 33]. Metasim can simulate the short
reads of Roches 454 pyrosequencing, Sanger sequencing and Empirical sequenc-
ing technology. In this paper, we use Roches 454 pyrosequencing simulation.
The output of Metasim is a compressed file containing the short reads of a
bacterial chromosome or one of its Plasmids and their information.

>r16.1|SOURCES={GI=11497281,bw,1947919816}|ERRORS={8 1:C,46:
,135 1:T,160 1:A,190 1:G}|SOURCE 1=”Borrelia burgdorferi B31 plasmid cp32-8”
(44840ff90be8dcf7b704d6908ca095d559d2949e)
TTTAGGATTCGTACCCGTTTTCTTCTAATTTTTTCCTAGTGTTGTATGAATTT
CTTTTAATTTTTTTTGTTTTTCTTTCATGCAAGATTTTTTTATATTGAATTTT
TTTATTAGGGCAATTTCATTTTGTTTTAAGTATATTTATTGCCTCAATCTTAG
TATACTTTATCAATATTTAAATACAAAATAGAAAGGAGCTTCTTCCGTTTTAA
AGTTACAATTATTGAAATAATTTCTTAGTTGATATTTTTCTATTTCTTTAATC
TTTCTTTCTTCTTTTATATTATTTTTATTA

Fig. 2. An example of Metasim reads

We chose 100 bacterial genomes from NCBI data set for simulating the short
reads. The first group of 50 bacteria from Insignia database are common in 100
chosen bacterial genomes and the other group is from some other bacteria apart
from these 100.
8 R. Karimi et al.

Before any implementation, some pre-processing is needed. We need to attach


the short reads from all bacterial chromosomes and Plasmids as one file, remove
the breaks between lines of the short reads and keep everything as a single line.
From the signatures we need just the signatures of every bacteria as a single file.
We should remove all extra information, in order to have smaller data size and
shorter execution time. The pre-processing is done with bash script programming
in Linux.

3.3 The Use of the Bitmap Index


Bitmap index techniques are used to create the index table by searching the
existence of signatures in short reads. ’1’ represents the existence of the signature
in the short reads and ’0’ represents non-existence. This process is done with Java
programming. There are faster programming languages for this purpose, but as a
future work we aim to use MapReduce programming and Hadoop to implement
this part and they are more compatible with Java.
As it is shown in Table 3, the index table can be created in two ways. The
first is to keep every single signature as a column and put ’0’ and ’1’ depending
on the existence of this signature in short reads. In this case, considering the
number of signatures and reads, huge storage is needed.

Table 3. An example for our index tables; each column of these tables is kept as a
single file

RID Reads b1 b2 b3 b4 b5 RID Reads b1


s1 s2 s3 s4 s5 s6 s7
1 R1 0 0 0 1 0 1 R1 0 0 0 0 0 0 0
2 R2 1 0 0 0 0 2 R2 0 0 0 0 0 0 1
3 R3 0 0 0 1 0 3 R3 0 0 0 0 0 0 0
4 R4 0 0 0 0 0 4 R4 0 0 0 0 0 0 0
5 R5 0 0 1 0 0 5 R5 0 0 0 0 0 0 0
6 R6 0 0 0 0 1 6 R6 0 0 0 0 0 0 0
7 R7 1 0 0 0 0 7 R7 0 0 0 1 0 0 0
8 R8 0 0 0 0 0 8 R8 0 0 0 0 0 0 0
9 R9 0 0 0 0 0 9 R9 0 0 0 0 0 0 0
10 R10 1 0 0 0 0 10 R10 0 0 1 0 0 0 0

Another way is to keep every bacteria as a column. We store ’1’ if any signature
of the bacteria exists in a short read, ’0’ if not. In this case the table is much
smaller. The number of columns is equal to the number of bacteria plus two more
columns, one for row identification and the other for short reads. The number
of rows is equal to the number of short reads.
We can easily use Linux paste command to put all the files together as a
single file. As an example, in Table 4 we have 6 files. One file contains the reads
and their identification numbers and the other five contain ’0’ and ’1’ for five
bacteria.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 9

Table 4. The newFile.txt format to create a single file

1 R1 0 0 0 1 0
2 R2 1 0 0 0 0
3 R3 0 0 0 1 0
4 R4 0 0 0 0 0
5 R5 0 0 1 0 0
6 R6 0 0 0 0 1
7 R7 1 0 0 0 0
8 R8 0 0 0 0 0
9 R9 0 0 0 0 0
10 R10 1 0 0 0 0

Then, we should create our table in Hive according to the newFile.txt struc-
ture.

hive> CREATE TABLE testTable1 ( rid INT, reads STRING, b1 INT, b2 INT
, b3 INT, b4 INT, b5 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
’\t’ STORED AS TEXTFILE;

Next step is loading data into the Hive table:


hive> LOAD DATA LOCAL INPATH "newFile.txt" INTO TABLE testTable1;
Finally with Hive queries we can find matched bacteria and short reads.
hive> INSERT OVERWRITE LOCAL DIRECTORY ’/path to local dir for output’
select * from testTable1 where b1=1 group by rid;

Alternatively we can use a faster query:

hive> INSERT OVERWRITE LOCAL DIRECTORY ’/path to local dir for output’
select testTable1.rid from testTable1 where b1=1;

The output file contains the Rowid numbers of the short reads. As an example,
for the bacteria b1 in table 4 we have 2,7,10 which means, the signatures of
bacteria b1 are in these 3 short reads. Moreover, we have bacteria b1 in the
Metagenome sample.
In this approach, we have to repeat the query for all bacteria one by one or
write a long query and a long command to create the table. Hence, the bigger
the number of bacteria, the longer the implementation.
There is a better solution to prevent repeating the queries or writing long
commands and queries. We can add all bacterial files with ’0’ and ’1’ one after
the other and create a single column in a file with the cat command.
For big number of bacteria, we can use bash script in the incremental order
to add as much bacteria as we need at the end of each other quickly.
In this method, we need also to repeat short reads in a single column as much
as the number of bacteria. For instance, if we have 500,000 short reads and 1000
10 R. Karimi et al.

bacteria, then we should repeat short reads in one column 1000 times with the
cat command and the total number will be 500,000,000.
Next, we need to create a table with 3 columns (rid INT, reads STRING, b
INT) and run the query just once. The results will be in one column. We can
easily extract the information with Rowid numbers. It leads to a larger file size,
but a faster implementation. After getting the results, we can delete these large
tables.
We created testtable1 with 52 columns (rid INT, reads STRING, b1 INT,...,
b50 INT) and testtable2 with 3 columns (rid INT, reads STRING, b INT)
in Hive. We have short reads of 100 bacteria and two groups of 50 bacterial
signatures.
As we are in the testing process and we use Java programming without Hadoop
and MapReduce for searching signatures in the short reads to create our index
files (tables), we chose only 10% of short reads randomly.
Our future work is defining MapReduce in our Java program and using multi-
node cluster Hadoop in order to speed up this step.
We used the awk command to add Row identification (Rowid) to the file
contains short reads (132,705).
awk ’BEGIN{i=1} {if($0 !~ /^$/) {printf ("%d\t%s \n",i,$0); i++}
else { print $0} }’ reads.txt >> readsid.txt
We merged this file and all the 50 index files with the paste command into
a single file and load this file in the testtable1 in Hive. Then, we used queries
to search our table. We have done this process for both groups of 50 bacteria.
For the second table (testtable2) we attached all 50 bacteria in order as
one column in a single file and also repeat the short reads 50 times in a sin-
gle column, both with the cat command. Then, we added Rowid to the short
reads (6,635,250) and finally paste these three columns in a file and load it to
testtable2. In this case, we only need one query to get the results. It can be a
good test to see the speed and efficiency of Hive to search millions of rows with
Bitmap Index techniques.
There is a possibility of integrating Hive and Hbase. This feature allows Hive
QL statements to access HBase tables for both read (SELECT) and write (IN-
SERT). It is even possible to combine access to HBase tables with native Hive
tables via joins and unions [34]. Real-time reading and writing is possible in
Hbase. These features help us update and have faster implementation.

4 Experimental Study

All these implementations are done by Intel dual-core CPU and 4 GB of RAM,
Ubuntu 13.10, single-node-cluster Hadoop-1.2.1 and Hive-0.11.0. We can see the
elapsed time for our first implementation on testtable1 with 52 columns and
132,705 rows and the loaded file size of 44.6 MB as given in Table 5. We repeated
the query for all 50 columns. We did not consider the time for changing and
repeating the queries.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 11

Table 5. Time taken for running the Hive query on testtable1 columns. Total time is
1065.927 Sec.

b1: 25.543 Sec. b11: 21.356 Sec. b21: 22.065 Sec. b31: 21.089 Sec. b41: 21.081 Sec.
b2: 22.236 Sec. b12: 21.120 Sec. b22: 21.116 Sec. b32: 22.013 Sec. b42: 21.016 Sec.
b3: 22.224 Sec. b13: 21.123 Sec. b23: 21.017 Sec. b33: 21.074 Sec. b43: 22.062 Sec.
b4: 21.187 Sec. b14: 22.065 Sec. b24: 20.977 Sec. b34: 20.991 Sec. b44: 21.000 Sec.
b5: 22.322 Sec. b15: 21.090 Sec. b25: 21.062 Sec. b35: 21.277 Sec. b45: 21.036 Sec.
b6: 21.167 Sec. b16: 21.083 Sec. b26: 21.057 Sec. b36: 20.010 Sec. b46: 21.009 Sec.
b7: 20.049 Sec. b17: 21.048 Sec. b27: 21.188 Sec. b37: 20.986 Sec. b47: 21.002 Sec.
b8: 20.048 Sec. b18: 21.123 Sec. b28: 21.108 Sec. b38: 22.063 Sec. b48: 20.997 Sec.
b9: 21.091 Sec. b19: 21.003 Sec. b29: 20.991 Sec. b39: 21.110 Sec. b49: 21.029 Sec.
b10: 22.373 Sec. b20: 21.072 Sec. b30: 21.081 Sec. b40: 20.952 Sec. b50: 22.136 Sec.

This implementation was for the first group of 50 bacteria which are common
in 100 bacterial samples. As we expected, we could find some short reads con-
taining the signatures for every bacteria. The number of short reads is a range
between 1 for b16 to 812 for b4.
As we expected, for the second group of 50 bacteria which differs by 100
samples, we could not find any short reads containing the signatures. The average
time taken for the implementation was almost the same as the first group.
Computational times for the second implementation on testtable2 with 3
columns and 6,635,250 rows and the loaded file size of 1.6 GB are:
File Size: 1.6 GB
Loading data to testtable2
Time taken: 45.588 seconds

Time taken with SELECT* and GROUP BY query:


Total MapReduce CPU Time Spent: 54 seconds 640 msec
Time taken: 59.901 seconds

Time taken with SELECT file.rid query which is faster:


Total MapReduce CPU Time Spent: 43 seconds 630 msec
Time taken: 44.452 seconds

The result of this implementation is a column containing numbers from 1 to


6,635,250 which represent Rowid of short reads. We repeated 132,705 reads for
50 times so, numbers from 1 to 132,705 are for b1 and from 132,706 to 2∗132, 705
are for b2, and so on.
If we compare the time for executing the query on a column of testtable1
with 132,705 rows and a column of testtable2 with 6,635,250 rows, in spite of
having 50 times more rows, there is not a large difference. Namely, the average
computation time for the first case is 21.319 Sec, while 59.901 Sec for the second
one with the same query.
12 R. Karimi et al.

We should consider that we are running Hadoop in a single-node with dual-


core CPU and 4 GB of RAM. This implementation shows that Bitmap Index
techniques are very efficient to speed up the Hive queries, and Hive itself is
powerful enough to search in very big tables with millions or billions of rows
or columns in commodity hardware. Moreover, with this method we could show
that, it is possible to identify species with DNA signatures from Metagenomics
samples without assembling and alignment and with any size of data.
This method is also useful for aligning the Metagenomics short reads with
finding the position of signatures and their matched short reads in the exist-
ing genome, besides other techniques. This method is also useful to check the
accuracy of signatures.

5 Conclusion
In this paper, we show the contributions of High Performance Computing and op-
timization techniques issued from databases to speed up searching and matching
a large amount of DNA signature in the short reads of hundreds (thousands) of
different microorganisms deployed in Hive. We adapt the concept of bitmap in-
dexes, routinely used in indexing large database tables for attributes with little
cardinality (such as gender). This preliminary work gives encouraging results and
opens new research perspectives to exploit optimization techniques issued from
databases and High Performance Computing in Bioinformatics. We are currently
testing our proposal on multi-node cluster Hadoop to speed up the process.

References
1. Tiedje, J.M.: Microbial diversity: of value to whom. ASM News 60(10), 524–525
(1994)
2. Allsopp, D., Colwell, R.R., Hawksworth, D.L., et al.: Microbial Diversity and
Ecosystem Function: Proceedings of the IUBS/IUMS Workshop held at Egham,
UK, August 10-13. CAB INTERNATIONAL (1995)
3. Kaeberlein, T., Lewis, K., Epstein, S.S.: Isolating “uncultivable” microorganisms
in pure culture in a simulated natural environment. Science 296(5570), 1127–1129
(2002)
4. Trapnell, C., Salzberg, S.L.: How to map billions of short reads onto genomes.
Nature Biotechnology 27(5), 455 (2009)
5. Thomas, T., Gilbert, J., Meyer, F.: Metagenomics-a guide from sampling to data
analysis. Microb. Inform. Exp. 2(3) (2012)
6. Haubold, B., Reed, F.A., Pfaffelhuber, P.: Alignment-free estimation of nucleotide
diversity. Bioinformatics 27(4), 449–455 (2011)
7. Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLoS Compu-
tational Biology 6(2), e1000667 (2010)
8. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing:
computational challenges and solutions. Nature Reviews Genetics 13(1), 36–46
(2012)
9. Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree
construction. Bioinformatics 19(16), 2122–2130 (2003)
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 13

10. Li, C., Yang, Y., Jia, M., Zhang, Y., Yu, X., Wang, C.: Phylogenetic analysis
of DNA sequences based on k-word and rough set theory. Physica A: Statistical
Mechanics and its Applications 398, 162–171 (2014)
11. Nagar, A., Hahsler, M.: Genomic sequence fragment identification using quasi-
alignment. In: Proceedings of the International Conference on Bioinformatics, Com-
putational Biology and Biomedical Informatics, p. 359. ACM (2013)
12. Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinfor-
matics 19(4), 513–523 (2003)
13. Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-free sequence
comparison based on next generation sequencing reads: Extended abstract. In:
Chor, B. (ed.) RECOMB 2012. LNCS, vol. 7262, pp. 272–285. Springer, Heidelberg
(2012)
14. Srinivasan, S.M., Guda, C.: MetaID: A novel method for identification and quan-
tification of metagenomic samples. BMC Genomics 14(8), 1–12 (2013)
15. Phillippy, A.M., Mason, J.A., Ayanbule, K., Sommer, D.D., Taviani, E., Huq, A.,
... Salzberg, S.L.: Comprehensive DNA signature discovery and validation. PLoS
Computational Biology 3(5), e98 (2007)
16. Phillippy, A.M., Ayanbule, K., Edwards, N.J., Salzberg, S.L.: Insignia: a DNA
signature search web server for diagnostic assay development. Nucleic Acids Re-
search 37(suppl. 2), W229–W234 (2009)
17. Satya, R.V., Kumar, K., Zavaljevski, N., Reifman, J.: A high-throughput pipeline
for the design of real-time pcr signatures. BMC Bioinformatics 11(1), 340 (2010)
18. Apache Hadoop available at http://hadoop.apache.org/
19. White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2012)
20. Cloudera Frequently Asked Questions (FAQs),
http://www.cloudera.com/content/cloudera/en/why-cloudera/
hadoop-and-big-data.html
21. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file
system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies
(MSST), pp. 1–10. IEEE (2010)
22. NoSQL Relational Database Management System homepage,
http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20Page
23. Michael, M., Moreira, J.E., Shiloach, D., Wisniewski, R.W.: Scale-up x scale-out:
A case study using nutch/lucene. In: IEEE International Parallel and Distributed
Processing Symposium, IPDPS 2007, pp. 1–8. IEEE (2007)
24. Bondi, A.B.: Characteristics of scalability and their impact on performance. In:
Proceedings of the 2nd International Workshop on Software and Performance,
pp. 195–203. ACM (2000)
25. Apache Hive available at http://hive.apache.org
26. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S.,
Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: 2010
IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005.
IEEE (2010)
27. Apache HBase available at http://hbase.apache.org
28. Karande, N.D.: Efficient indexing technique using bitmap indices for data ware-
houses. International Journal 1(4) (2013)
29. Bellatreche, L., Missaoui, R., Necir, H., Drias, H.: A data mining approach for
selecting bitmap join indices. JCSE 1(2), 177–194 (2007)
14 R. Karimi et al.

30. National Center for Biotechnology Information (NCBI),


ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
31. Insignia Homepage, http://insignia.cbcb.umd.edu/index.php
32. Metasim Homepage, http://ab.inf.uni-tuebingen.de/software/metasim/
33. Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasima sequencing
simulator for genomics and metagenomics. PloS One 3(10), e3373 (2008)
34. Hbase and Hive integration,
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
Centroid Clustering of Cellular Lineage Trees

Valeriy Khakhutskyy1, , Michael Schwarzfischer2, , Nina Hubig2,3, ,


Claudia Plant2,3 , Carsten Marr2 , Michael A. Rieger4 ,
Timm Schroeder5 , and Fabian J. Theis2,6
1
Institute for Advanced Study, Technische Universität München,
Lichtenbergstrasse 2a, 85748 Garching, Germany
[email protected]
2
Institute of Computational Biology, Helmholtz Center Munich,
German Research Center for Environmental Health (GmbH),
Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
{schwarzfischer,nina.hubig,claudia.plant,carsten.marr,
fabian.theis}@helmholtz-muenchen.de
3
Department of Informatics, Technische Universität München,
Boltzmannstr. 3, 85748 Garching, Germany
4
LOEWE Center for Cell and Gene Therapy and
Department of Hematology/Oncology, University Hospital Frankfurt,
Theodor-Stern-Kai 7, 60590 Frankfurt (Main)
[email protected]
5
Department of Biosystems Science and Engineering, ETH Zurich,
Mattenstr. 26, 4058 Basel, Switzerland
[email protected]
6
Department of Mathematics, Technische Universität München,
Boltzmannstr. 3, 85748 Garching, Germany

Abstract. Trees representing hierarchical knowledge are prevalent in


biology and medicine. Some examples are phylogenetic trees, the hi-
erarchical structure of biological tissues and cell lines. The increasing
throughput of techniques generating such trees poses new challenges to
the analysis of tree ensembles. Some typical tasks include the determi-
nation of common patterns of lineage decisions in cellular differentiation
trees. Partitioning the dataset is crucial for further analysis of the cel-
lular genealogies. In this work, we develop a method to cluster labeled
binary tree structures. Furthermore, for every cluster our method selects
a centroid tree that captures the characteristic mitosis patterns of the
group. We evaluate this technique on synthetic data and apply it to ex-
perimental trees that embody the lineages of differentiating cells under
specific conditions over time. The results of the cell lineage trees are
thoroughly interpreted with expert domain knowledge.

Keywords: tree clustering, cell lineage tree, centroid tree.

1 Introduction
Cell lineage trees encode the cell division events over time and can be represented
as binary trees. These trees challenge current machine learning techniques to give

These authors contributed equally.

M. Bursa et al. (Eds.): ITBAM 2014, LNCS 8649, pp. 15–29, 2014.

c Springer International Publishing Switzerland 2014
16 V. Khakhutskyy et al.

GMP Progenitor cell


Time-lapse microscopy
& Differentiated cell
single-cell tracking

G M
Lineage trees

Fig. 1. Time-lapse microscopy and single-cell tracking of granulocyte-macrophage


progenitor cells (GMPs) differentiating into differentiated granulocytes (G) or
macrophages (M) results in a set of lineage trees [21]. The loss of progenitor state
is monitored by the cellular expression of LysM::GFP marker in the time-lapse movies
allowing to label each cell (i.e. node) of a lineage tree as progenitor or differentiated
cell.

a broader view and a more accurate interpretation of the underlying cell devel-
opment processes. Our subject of interest is labeled lineage trees from cells of
the blood system as depicted in Figure 1. In this work, we use single-cell data
of time-lapse microscopy experiments encoded as trees with root nodes belong-
ing to blood progenitor cells that differentiate into more specialized cell types
(leaves). In particular, granulocyte-macrophage progenitor cells (GMPs) evolve
into mature macrophages (M) or granulocytes (G). Additionally, we measure a
fluorescence marker (LysM::GFP) that indicates whether a differentiation into
M or G has taken place [21]. However, this marker only implies if a cell has
lost its progenitor state but gives no information about its particular lineage.
Therefore, we aim to find differences in the lineage tree structures between the
two differentiation programs.
The differentiation process can be instructed by additional cytokines leading
to almost exclusively differentiated cells of one lineage [21]. To determine a typ-
ical lineage-specific tree we analyze lineage trees instructed to one or the other
lineage and calculate tree distances based on different metrics. Next, we devel-
oped a method to assign a representative tree for every condition. This enables
us to distinguish different cell types just by looking at their characteristic repre-
sentatives. Furthermore, we developed a method to cluster a set of lineage trees
based on k-medoid methods, which unlike k-means, is more robust to noise and
outliers that are common in the real biological datasets like ours. With this tech-
nique we partition the data into naturally evolving parts allowing to gain insights
into typical lineage tree structures of differentiating blood progenitor cells.
In short, the contributions of this paper are as follows:
– Tree Clustering: We find similarities between a set of trees covering the whole
pedigree of a progenitor cell.
– Representative Centroid Trees: We are able to generate a set of fitting cen-
troid trees that represent the characteristics of the underlying clusters.
Centroid Clustering of Cellular Lineage Trees 17

– Application and Interpretation of Cell Lineage Trees: We apply our clus-


tering algorithm to the cell division data and comprehensively analyze the
results with expert domain knowledge.
The remainder of this paper is organized as follows: We discuss the related
work in this research field in Section 2. Then we introduce the notation and def-
initions used throughout this paper in Section 3. Section 4 formally defines the
underlying mathematical problems and describes the algorithms. Section 5 fol-
lows with the core part of this work: the evaluation of the descriptive properties of
the algorithms on synthetic data and thorough examination and interpretation of
the results when applied to our real dataset. We conclude this work in Section 6.

2 Related Work
Trees play an important role for the scientific areas which use tree structures
to describe observations, e.g. computational biology, structured text databases,
natural language processing, web mining, image analysis and computer vision,
pattern recognition as well as compiler optimization [11,4,8]. Especially the min-
ing of web data like xml-files [15,9] and decision tree clustering [19] is widely
discussed in literature.
All discussed clustering methods in this paper require a notion of distance
between trees. Unfortunately, the scientific community does not agree on one es-
tablished method of finding a metric between trees. One commonly used method
is the Tree Edit Distance (TED) [29]. Similar to Levenstein edit distance, TED
is defined as the minimal number of operations needed to transform one tree
into another. But Arora et al. showed that for unordered labeled trees as con-
sidered in this paper the calculation of TED is NP-hard, even MAX SNP-hard
[2]. Also to apply the metrics for ordered labeled trees to unordered trees would
lead to a considerable loss of efficiency. Zhang [28] suggested to use constrained
TED (cTED ) to calculate the metrics for unordered trees. cTED is a dynamic
programming method that solves a large optimization problem by breaking it
down into smaller sub-problems. Another suitable method to establish a metric
for the space of unordered labeled trees was suggested by Torsello et al. [25].
This method is based on the computation of a maximal similarity (MaxSim-
ilarity) common subtree between two trees. We will compare cTED and four
MaxSimilarity tree metrics in our evaluations.
Tree clustering for shape recognition was intensively studied in the group
around Torsello and Hanckock. In 2001 Luo et al. used an EM-like algorithm for
clustering 2D binary shapes based on the edit distances of their shock-trees from
the Hamilton-Jacobi skeleton [14]. Since then the group published a number of
methods for tree clustering focusing on pattern recognition of 2D binary shapes.
It was also suggested to cluster trees after embedding them into a so-called union
tree space [24] or into the euclidean space [26].
Graph clustering has gained interest in the last decade in the machine learning
community. It is related to the problem discussed in this paper since trees can
be considered as a special case of undirected acyclic labeled graphs. A centroid
18 V. Khakhutskyy et al.

based k-means algorithm was suggested by Jain and Wysotzki [10]. Ferrer et al.
discussed central clustering using k-medoids and k-medians approaches [7]. Some
methods aim to embed graphs into a metric vector space, e.g. the spectral em-
bedding method suggested by Luo, Wilson, and Hancock [13]. These algorithms,
however, are not directly applicable to tree clustering problems as the resulting
mean or median graphs are not necessarily proper trees. Moreover, as graphs are
a more general data structure, algorithms for distance calculation on graphs often
require significantly higher computational costs than their counterparts on trees.
We developed a method for finding centroid trees in a set of unordered la-
beled trees that has an intuitive interpretation, does not rely on a vector space
embedding, and can be used with different similarity metrics as we will show in
the experimental section.
Finally, we would like to mention that search for frequent common subtrees in
a tree database as a method to obtain a condensed representation of pattern in
trees has gained popularity in recent years [3,18,27]. The search algorithms are
tangential to our current research as they do not lead to clustering. However,
given a group of trees they could help one to find a meaningful interpretation of
the results.

3 Notation and Definitions


In this section we introduce the notation used throughout this paper as well as
we formally define the problem of finding a medoid in a set of unordered labeled
trees.
Let T be a metric space with a metric d : T × T → R and let T =
{T1 , T2 , ..., Tn } be a finite set of elements Ti ∈ T , i = 1, . . . , n. We call an el-
ement T̂ ∈ T an Lp -centroid if it is a general Fréchet mean on the metric space
[1]: 
T̂ = arg min d(T, Ti )p . (1)
T ∈T
Ti ∈T

We call an Lp -centroid a mean if p = 2 and we call it a median if p = 1. Note


that in this case the Lp -centroid does not belong to an element of the set.
An Lp -medoid is defined as the solution of the problem (1) with the restriction
that the minimizer needs to be from the set T itself:

T̂ = arg min d(T, Ti )p . (2)
T ∈T
Ti ∈T

Similarly to the definitions before, we call the minimizer an L2 -medoid if p = 2


and we call it an L1 -medoid if p = 1.
Now, we introduce the definition of a general tree and extend it to the kind
of trees we are interested in.
Definition 1 (tree). A general tree is a tuple T = (V, E), where V is a set of
nodes and E is a set of directed edges between the nodes. A node v has a child
Centroid Clustering of Cellular Lineage Trees 19

w if there is an edge (v, w) ∈ E. For any two nodes v, w ∈ V, w is called a


descendant of v if there is a path (e1 , e2 , . . . , en ) ∈ En that starts at v and ends
at w. The node v is then called an ancestor of w. If w is a descendant of v, there
is always a unique path connecting them. A node r ∈ V is called the root of a
tree if it has no ancestors and all nodes V\r are the descendants of r.
In our discussion we focus on binary trees, which means that every node can
have at most two children. Unordered labeled trees – the set of trees our real
data corresponds to – are a special case of generalized trees:
Definition 2 (ordered and unordered trees). A tree T = (V, E, ν) is called
ordered if a function ν : V → V × V is defined that maps a node u to a
tuple (v, w) of its children. T is called unordered if the mapping is defined as
ν : V → P2 (V), i.e. the order of children {v, w} is not fixed.
Definition 3 (labeled trees). A tree T = (V, E, ν, σ, Σ) is called labeled if
a function σ : V → Σ is defined that maps every node v to an element of the
alphabet Σ.
Now let T be a space of unordered labeled trees. By extending it with a metric,
we obtain a metric space. Thus, the definitions of mean, median and Lp -medoids
are directly applicable to our case.
Median and mean centroids are popular for problems in geodesic metric spaces
with well studied geometries, e.g. some CAT(k) spaces. Otherwise, the solution
of the problem (1) amounts to application of random search methods [22,20]
that have high computational costs and low rates of convergence. in clustering,
that can be avoided by using a medoid.
This motivates us to focus on Lp -medoids in this work. The results discussed
in Section 5 describe the L1 -medoid trees due to their robustness to outliers.
Therefore from here on we will use the terms medoid tree and centroid tree
interchangeably if the context is clear.1

4 A Tree Clustering Algorithm for Cell Lineages


As explained in Section 3, centroid clustering algorithms require a definition of
a metric space, which is not trivial for a tree space. Therefore, we will start this
section with a brief review of metrics we were considering in this paper. After-
wards, we will describe the underlying optimization problems and give details
to the clustering algorithm in use.

4.1 Tree Dissimilarity Metrics


Constrained tree edit distance mapping is defined by a triple (M, T1 , T2 ),
where T1 and T2 are two trees and M is a set of ordered tuples (v, w) ∈ V1 × V2 ,
which satisfies the following conditions:
1
Some literature calls this type of centroids “median trees” as apposed to “generalized
median trees”, which are median trees in our notation.
20 V. Khakhutskyy et al.

1. M is an edit distance mapping


2. ∀(v1 , w1 ), (v2 , w2 ), (v3 , w3 ) ∈ M let T1 [v] := lca(T1 [v1 ], T1 [v2 ]) and T2 [w] :=
lca(T2 [w1 ], T2 [w2 ]), where lca represents the least common ancestor and T [v]
represents a subtree of T induced by a node v. T1 [v] is a proper ancestor of
T1 [v3 ] iff T2 [w] is a proper ancestor of T2 [w3 ].
The first condition means that M should injectively map nodes of T1 to nodes of
T2 maintaining an ancestor-descendant relationship between the mapped nodes.
The second condition ensures that two different subtrees of T1 have to be
mapped on two different subtrees of T2 . This condition is sufficient and even
desirable for many different problems, in particular for the problem discussed
later in this paper, where nodes represent the phases of cell separation.
Finding cTED resolves to a dynamic programming method that solves a large
optimization problem by breaking it down into smaller sub-problems [28].
MaxSimilarity metrics are based on the computation of maximal similarity
common subtree between two trees [25]. Two trees T1 and T2 are called isomor-
phic if there is an isomorphism φ that maps each node of the tree T1 to each node
of the tree T2 . For two subtrees T1 = (V1 , E1 ) and T2 = (V2 , E2 ) the bijection
φ : H1 → H2 , with H1 ⊆ V1 , H2 ⊆ V2 is called subtree isomorphism iff:
1. ∀u, v ∈ H1 : u adjacent with v ⇔ φ(u) adjacent with φ(v) and
2. both induced subtrees T1 [H1 ] and T2 [H2 ] are connected
The problem
 is to find maximum similarity subtree isomorphism φ, so that
Wσ (φ) = u∈H1 σ(u, φ(u)) is the largest among all subtree isomorphisms be-
tween T1 and T2 . Let σ(u, w) be the similarity function. Then the maximal
common similarity between subtrees T1 [H1 ] and T2 [H2 ] is defined as

Wσ (φ∗ ) = min σ(u, φ(u)). (3)
φ
u∈H1


Using Wσ (φ ) Torsello at al. define and prove the properties of the MaxSim-
ilarity metrics listed in Table 1.

Table 1. Different metrics used in this work to calculate distances between trees

Metric Tree distance


cTED –
MaxSimilarity 1 d1 (T1 , T2 ) = max(|T1 |, |T2 |) − Wσ (φ∗)
MaxSimilarity 2 d2 (T1 , T2 ) = |T1 | + |T2 | − 2Wσ (φ∗)
Wσ (φ∗)
MaxSimilarity 3 d3 (T1 , T2 ) = 1 − max(|T 1 |,|T2 |)
Wσ (φ∗)
MaxSimilarity 4 d4 (T1 , T2 ) = 1 − |T1 |+|T 2 |−Wσ (φ∗)

4.2 Clustering as an Optimization Problem


Clustering by using a k-means or k-medians algorithm  divides the dataset A =
{a1 , . . . , aN } into disjoint non-empty subsets Bi , i Bi = A, together with a set
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Ozymandias
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Ozymandias

Author: Ivar Jorgensen

Illustrator: Dan Adkins

Release date: November 9, 2023 [eBook #72080]

Language: English

Original publication: New York, NY: Royal Publications, Inc, 1958

Credits: Greg Weeks, Mary Meehan and the Online Distributed


Proofreading Team at http://www.pgdp.net

*** START OF THE PROJECT GUTENBERG EBOOK


OZYMANDIAS ***
OZYMANDIAS

By IVAR JORGENSON

Illustrated by DAN ADKINS

There was open strife between the military


and scientific staffs. But which was mightier?

[Transcriber's Note: This etext was produced from


Infinity November 1958.
Extensive research did not uncover any evidence that
the U.S. copyright on this publication was renewed.]
The planet had been dead about a million years. That was our first
impression, as our ship orbited down to its sere brown surface, and
as it happened our first impression turned out to be right. There had
been a civilization here once—but Earth had swung around Sol ten-
to-the-sixth times since the last living being of this world had drawn
breath.
"A dead planet," Colonel Mattern exclaimed bitterly. "Nothing here
that's of any use. We might as well pack up and move on."
It was hardly surprising that Mattern would feel that way. In urging a
quick departure and an immediate removal to some world of greater
utilitarian value, Mattern was, after all, only serving the best interests
of his employers. His employers were the General Staff of the Armed
Forces of the United States of America. They expected Mattern and
his half of the crew to produce results, and by way of results they
meant new weapons and sources of strategic materials. They hadn't
tossed in 70% of the budget for this trip just to sponsor a lot of
archaeological putterings.
But luckily for our half of the outfit—the archaeological putterers' half
—Mattern did not have an absolute voice in the affairs of the outfit.
Perhaps the General Staff had kicked in for 70% of our budget, but
the cautious men of the military's Public Liaison branch had seen to
it that we had at least some rights.
Dr. Leopold, head of the non-military segment of the expedition, said
brusquely, "Sorry, Mattern, but I'll have to apply the limiting clause
here."
Mattern started to sputter. "But—"
"But nothing, Mattern. We're here. We've spent a good chunk of
American cash in getting here. I insist that we spend the minimum
time allotted for scientific research, as long as we are here."
Mattern scowled, looking down at the table, supporting his chin in his
thumbs and digging the rest of his fingers in hard back of his
jawbone. He was annoyed, but he was smart enough to know he
didn't have much of a case to make against Leopold.
The rest of us—four archaeologists and seven military men; they
outnumbered us a trifle—watched eagerly as our superiors battled.
My eyes strayed through the porthole and I looked at the dry
windblown plain, marked here and there with the stumps of what
might have been massive monuments millennia ago.
Mattern said bleakly, "The world is of utterly no strategic
consequence. Why, it's so old that even the vestiges of civilization
have turned to dust!"
"Nevertheless, I reserve the right granted to me to explore any world
we land on, for a period of at least one hundred sixty-eight hours,"
Leopold returned implacably.
Exasperated, Mattern burst out, "Dammit, why? Just to spite me?
Just to prove the innate intellectual superiority of the scientist to the
man of war?"
"Mattern, I'm not injecting personalities into this."
"I'd like to know what you are doing, then? Here we are on a world
that's obviously useless to me and probably just as useless to you.
Yet you stick me on a technicality and force me to waste a week
here. Why, if not out of spite?"
"We've made only the most superficial reconnaissance so far,"
Leopold said. "For all we know this place may be the answer to
many questions of galactic history. It may even be a treasure-trove of
superbombs, for all—"
"Pretty damned likely!" Mattern exploded. He glared around the
conference room, fixing each of the scientific members of the
committee with a baleful stare. He was making it quite clear that he
was trapped into a wasteful expense of time by our foggy-eyed
desire for Knowledge.
Useless knowledge. Not good hard practical knowledge of the kind
he valued.
"All right," he said finally. "I've protested and I've lost, Leopold. You're
within your rights in insisting on remaining here one week. But you'd
damned well better be ready to blast off when your time's up!"

It had been foregone all along, of course. The charter of our


expedition was explicit on the matter. We had been sent out to comb
a stretch of worlds near the Galactic Rim that had already been
brushed over hastily by a survey mission.
The surveyors had been looking simply for signs of life, and, finding
none (of course), they had moved on. We were entrusted with the
task of investigating in detail. Some of the planets in the group had
been inhabited once, the surveyors had reported. None bore present
life. None of the planets we had ever visited had been found to hold
intelligent life, though many had in the past.
Our job was to comb through the assigned worlds with diligence.
Leopold, leading our group, had the task of doing pure
archaeological research on the dead civilizations; Mattern and his
men had the more immediately practical job of looking for fissionable
material, leftover alien weapons, possible sources of lithium or tritium
for fusion, and other such militarily useful things. You might argue
that in a strictly pragmatic sense our segment of the group was just
dead weight, carted along for the ride at great expense, and you
would be right.
But the public temper over the last few hundred years in America
has frowned on purely military expeditions. And so, as a sop to the
nation's conscience, five archaeologists of little empirical
consequence so far as national security mattered were tacked onto
the expedition.
Us.
Mattern made it quite clear at the outset that his boys were the
Really Important members of the expedition, and that we were
simply ballast. In a way, we had to agree. Tension was mounting
once again on our sadly disunited planet; there was no telling when
the Other Hemisphere would rouse from its quiescence of a hundred
years and decide to plunge once more into space. If anything of
military value lay out here, we knew we had to find it before They
did.
The good old armaments race. Hi-ho! The old space stories used to
talk about expeditions from Earth. Well, we were from Earth,
abstractly speaking—but in actuality we were from America, period.
Global unity was as much of a pipedream as it had been three
hundred years earlier, in the remote and primitive chemical-rocket
era of space travel. Amen. End of sermon. We got to work.

The planet had no name, and we didn't give it one; a special


commission of what was laughably termed the United Nations
Organization was working on the problem of assigning names to the
hundreds of worlds of the galaxy, using the old idea of borrowing
from ancient Terran mythologies in analogy to the Mercury-Venus-
Mars nomenclature of our own system.
Probably they would end up saddling this world with something like
Thoth or Bel-Marduk or perhaps Avalokitesvara. We knew it simply
as Planet Four of the system belonging to a yellow-white F5 IV
Procyonoid sun, Revised HD Catalog #170861.
It was roughly Earthtype, with a diameter of 6100 miles, a gravity
index of .93, a mean temperature of 45 degrees F. with a daily
fluctuation range of about ten degrees, and a thin, nasty atmosphere
composed mostly of carbon dioxide with wisps of helium and
nitrogen and the barest smidgeon of oxygen. Quite possibly the air
had been breathable by humanoid life a million years ago—but that
was a million years ago. We took good care to practice our
breathing-mask drills before we ventured out of the ship.
The sun, as noted, was an F5 IV and fairly hot, but Planet Four was
a hundred eighty-five million miles away from it at perihelion and a
good deal farther when it was at the other swing of its rather
eccentric orbit; the good old Keplerian ellipse took quite a bit of
punishment in this system. Planet Four reminded me in many ways
of Mars—except that Mars, of course, had never known intelligent
life of any kind, at least none that had troubled to leave a hint of its
existence, while this planet had obviously had a flourishing
civilization at a time when Pithecanthropus was Earth's noblest
being.

In any event, once we had thrashed out the matter of whether or not
we were going to stay here or pull up and head for the next planet on
our schedule, the five of us set to work. We knew we had only a
week—Mattern would never grant us an extension unless we came
up with something good enough to change his mind, which was
improbable—and we wanted to get as much done in that week as
possible. With the sky as full of worlds as it is, this planet might
never be visited by Earth scientists again.
Mattern and his men served notice right away that they were going
to help us, but reluctantly and minimally. We unlimbered the three
small halftracks carried aboard ship and got them into functioning
order. We stowed our gear—cameras, pick-&-shovels, camel's-hair
brushes—and donned our breathing-masks, and Mattern's men
helped us get the halftracks out of the ship and pointed in the right
direction.
Then they stood back and waited for us to shove off.
"Don't any of you plan to accompany us?" Leopold asked. The
halftracks each held up to four men.
Mattern shook his head. "You fellows go out by yourselves today and
let us know what you find. We can make better use of the time filing
and catching up on back log entries."
I saw Leopold start to scowl. Mattern was being openly
contemptuous; the least he could do was have his men make a
token search for fissionable or fusionable matter! But Leopold
swallowed down his anger.
"Okay," he said. "You do that. If we come across any raw veins of
plutonium I'll radio back."
"Sure," Mattern said. "Thanks for the favor. Let me know if you find a
brass mine, too." He laughed harshly. "Raw plutonium! I half believe
you're serious!"

We had worked out a rough sketch of the area, and we split up into
three units. Leopold, alone, headed straight due west, toward the dry
riverbed we had spotted from the air. He intended to check alluvial
deposits, I guess.
Marshall and Webster, sharing one halftrack, struck out to the hilly
country southeast of our landing point. A substantial city appeared to
be buried under the sand there. Gerhardt and I, in the other vehicle,
made off to the north, where we hoped to find remnants of yet
another city. It was a bleak, windy day; the endless sand that
covered this world mounted into little dunes before us, and the wind
picked up handfuls and tossed it against the plastite dome that
covered our truck. Underneath the steel cleats of our tractor-belt,
there was a steady crunch-crunch of metal coming down on sand
that hadn't been disturbed in millennia.
Neither of us spoke for a while. Then Gerhardt said, "I hope the
ship's still there when we get back to the base."
Frowning, I turned to look at him as I drove. Gerhardt had always
been an enigma: a small scrunchy guy with untidy brown hair
flapping in his eyes, eyes that were set a little too close together. He
had a degree from the University of Kansas and had put in some
time on their field staff with distinction, or so his references said.
I said, "What the hell do you mean?"
"I don't trust Mattern. He hates us."
"He doesn't. Mattern's no villain—just a fellow who wants to do his
job and go home. But what do you mean, the ship not being there?"
"He'll blast off without us. You see the way he sent us all out into the
desert, and kept his own men back. I tell you, he'll strand us here!"
I snorted. "Don't be a paranoid. Mattern won't do anything of the
sort."
"He thinks we're dead weight on the expedition," Gerhardt insisted.
"What better way to get rid of us?"
The halftrack breasted a hump in the desert. I kept wishing a vulture
would squeal somewhere, but there was not even that. Life had left
this world ages ago. I said, "Mattern doesn't have much use for us,
sure. But would he blast off and leave three perfectly good halftracks
behind? Would he?"
It was a good point. Gerhardt grunted agreement after a while.
Mattern would never toss equipment away, though he might not have
such scruples about five surplus archaeologists.
We rode along silently for a while longer. By now we had covered
twenty miles through this utterly barren land. As far as I could see,
we might just as well have stayed at the ship. At least there we had a
surface lie of building foundations.
But another ten miles and we came across our city. It seemed to be
of linear form, no more than half a mile wide and stretching out as far
as we could see—maybe six or seven hundred miles; if we had time,
we would check the dimensions from the air.
Of course it wasn't much of a city. The sand had pretty well covered
everything, but we could see foundations jutting up here and there,
weathered lumps of structural concrete and reinforced metal. We got
out and unpacked the power-shovel.
An hour later, we were sticky with sweat under our thin spacesuits
and we had succeeded in transferring a few thousand cubic yards of
soil from the ground to an area a dozen yards away. We had dug
one devil of a big hole in the ground.
And we had nothing.
Nothing. Not an artifact, not a skull, not a yellowed tooth. No spoons,
no knives, no baby-rattles.
Nothing.
The foundations of some of the buildings had endured, though
whittled down to stumps by a million years of sand and wind and
rain. But nothing else of this civilization had survived. Mattern, in his
scorn, had been right, I admitted ruefully: this planet was as useless
to us as it was to them. Weathered foundations could tell us little
except that there had once been a civilization here. An imaginative
paleontologist can reconstruct a dinosaur from a fragment of a thigh-
bone, can sketch out a presentable saurian with only a fossilized
ischium to guide him. But could we extrapolate a culture, a code of
laws, a technology, a philosophy, from bare weathered building
foundations?
Not very likely.
We moved on and dug somewhere else half a mile away, hoping at
least to unearth one tangible remnant of the civilization that had
been. But time had done its work; we were lucky to have the building
foundations. All else was gone.
"Boundless and bare, the lone and level sands stretch far away," I
muttered.
Gerhardt looked up from his digging. "Eh? What's that?" he
demanded.
"Shelley," I told him.
"Oh. Him."
He went back to digging.

Late in the afternoon we finally decided to call it quits and head back
to the base. We had been in the field for seven hours, and had
nothing to show for it except a few hundred feet of tridim films of
building foundations.
The sun was beginning to set; Planet Four had a thirty-five hour day,
and it was coming to its end. The sky, always somber, was darkening
now. There was no moon to be still as bright. Planet Four had no
satellites. It seemed a bit unfair; Three and Five of the system each
had four moons, while around the massive gas giant that was Eight a
cluster of thirteen moonlets whirled.
We wheeled round and headed back, taking an alternate route three
miles east of the one we had used on the way out, in case we might
spot something. It was a forlorn hope, though.
Six miles along our journey, the truck radio came to life. The dry,
testy voice of Dr. Leopold reached us:
"Calling Trucks Two and Three. Two and Three, do you read me?
Come in, Two and Three."
Gerhardt was driving. I reached across his knee to key in the
response channel and said, "Anderson and Gerhardt in Number
Three, sir. We read you."
A moment later, somewhat more faintly, came the sound of Number
Two keying into the threeway channel, and I heard Marshall saying,
"Marshall and Webster in Two, Dr. Leopold. Is something wrong?"
"I've found something," Leopold said.
From the way Marshall exclaimed "Really!" I knew that Truck
Number Two had had no better luck than we. I said, "That makes
one of us, then."
"You've had no luck, Anderson?"
"Not a scrap. Not a potsherd."
"How about you, Marshall?"
"Check. Scattered signs of a city, but nothing of archaeological
value, sir."
I heard Leopold chuckle before he said, "Well, I've found something.
It's a little too heavy for me to manage by myself. I want both outfits
to come out here and take a look at it."
"What is it, sir?" Marshall and I asked simultaneously, in just about
the same words.
But Leopold was fond of playing the Man of Mystery. He said, "You'll
see when you get here. Take down my coordinates and get a move
on. I want to be back at the base by nightfall."

Shrugging, we changed course to head for Leopold's location. He


was about seventeen miles southwest of us, it seemed. Marshall and
Webster had an equally long trip to make; they were sharply
southeast of Leopold's position.
The sky was fairly dark when we arrived at what Leopold had
computed as his coordinates. The headlamps of the halftrack lit up
the desert for nearly a mile, and at first there was no sign of anyone
or anything. Then I spotted Leopold's halftrack parked off to the east,
and from the south Gerhardt saw the lights of the third truck rolling
toward us.
We reached Leopold at about the same time. He was not alone.
There was an—object—with him.
"Greetings, gentlemen." He had a smug grin on his whiskery face. "I
seem to have made a find."
He stepped back and, as if drawing an imaginary curtain, let us take
a peek at his find. I frowned in surprise and puzzlement. Standing in
the sand behind Leopold's halftrack was something that looked very
much like a robot.
It was tall, seven feet or more, and vaguely humanoid: that is, it had
arms extending from its shoulders, a head on those shoulders, and
legs. The head was furnished with receptor plates where eyes, ears,
and mouth would be on humans. There were no other openings. The
robot's body was massive and squarish, with sloping shoulders, and
its dark metal skin was pitted and corroded as by the workings of the
elements over uncountable centuries.
It was buried up to its knees in sand. Leopold, still grinning smugly
(and understandably proud of his find) said, "Say something to us,
robot."
From the mouth-receptors came a clanking sound, the gnashing of—
what? gears?—and a voice came forth, oddly high-pitched but
audible. The words were alien and were spoken in a slippery
singsong kind of inflection. I felt a chill go quivering down my back.
The Age of Space Exploration was three centuries old—and for the
first time human ears were hearing the sounds of a language that
had not been spawned on Earth.
"It understands what you say?" Gerhardt questioned.
"I don't think so," Leopold said. "Not yet, anyway. But when I address
it directly, it starts spouting. I think it's a kind of—well, guide to the
ruins, so to speak. Built by the ancients to provide information to
passersby; only it seems to have survived the ancients and their
monuments as well."
I studied the thing. It did look incredibly old—and sturdy; it was so
massively solid that it might indeed have outlasted every other
vestige of civilization on this planet. It had stopped talking, now, and
was simply staring ahead. Suddenly it wheeled ponderously on its
base, swung an arm up to take in the landscape nearby, and started
speaking again.
I could almost put the words in its mouth: "—and over here we have
the ruins of the Parthenon, chief temple of Athena on the Acropolis.
Completed in the year 438 B.C., it was partially destroyed by an
explosion in 1687 while in use as a powder magazine by the Turks
—"
"It does seem to be a sort of a guide," Webster remarked. "I get the
definite feeling that we're being given an historical narration now, all
about the wondrous monuments that must have been on this site
once."
"If only we could understand what it's saying!" Marshall exclaimed.
"We can try to decipher the language somehow," Leopold said.
"Anyway, it's a magnificent find, isn't it? And—"
I began to laugh suddenly. Leopold, offended, glared at me and said,
"May I ask what's so funny, Dr. Anderson?"
"Ozymandias!" I said, when I had subsided a bit. "It's a natural!
Ozymandias!"
"I'm afraid I don't—"
"Listen to him," I said. "It's as if he was built and put here for those
who follow after, to explain to us the glories of the race that built the
cities. Only the cities are gone, and the robot is still here! Doesn't he
seem to be saying, 'Look on my works, ye Mighty, and despair!'"
"Nothing besides remains," Webster quoted. "It's apt. Builders and
cities all gone, but the poor robot doesn't know it, and delivers his
spiel nonetheless. Yes. We ought to call him Ozymandias!"
Gerhardt said, "What shall we do with it?"
"You say you couldn't budge it?" Webster asked Leopold.
"It weighs five or six hundred pounds. It can move of its own volition,
but I couldn't move it myself."
"Maybe the five of us—" Webster suggested.
"No," Leopold said. An odd smile crossed his face. "We will leave it
here."
"What?"
"Only temporarily," he added. "We'll save it—as a sort of surprise for
Mattern. We'll spring it on him the final day, letting him think all along
that this planet was worthless. He can rib us all he wants—but when
it's time to go, we'll produce our prize!"
"You think it's safe to leave it out here?" Gerhardt asked.
"Nobody's going to steal it," Marshall said.
"And it won't melt in the rain," Webster added.
"But—suppose it walks away?" Gerhardt demanded. "It can do that,
can it not?"
Leopold said, "Of course. But where would it go? It will remain where
it is, I think. If it moves, we can always trace it with the radar. Back to
the base, now; it grows late."
We climbed back into our halftracks. The robot, silent once again,
planted knee-deep in the sand, outlined against the darkening sky,
swivelled to face us and lifted one thick arm in a kind of salute.
"Remember," Leopold warned us as we left. "Not one word about
this to Mattern!"

At the base that night, Colonel Mattern and his seven aides were
remarkably curious about our day's activities. They tried to make it
seem as if they were taking a sincere interest in our work, but it was
perfectly obvious to us that they were simply goading us into telling
them what they had anticipated—that we had found absolutely
nothing. This was the response they got, since Leopold forbade
mentioning Ozymandias. Aside from the robot, the truth was that we
had found nothing, and when they learned of this they smiled
knowingly, as if saying that had we listened to them in the first place
we would all be back on Earth seven days earlier, with no loss.
The following morning after breakfast Mattern announced that he
was sending out a squad to look for fusionable materials, unless we
objected.
"We'll only need one of the halftracks," he said. "That leaves two for
you. You don't mind, do you?"
"We can get along with two," Leopold replied a little sourly. "Just so
you keep out of our territory."
"Which is?"
Instead of telling him, Leopold merely said, "We've adequately
examined the area to the southeast of here, and found nothing of
note. It won't matter to us if your geological equipment chews the
place up."
Mattern nodded, eyeing Leopold curiously as if the obvious
concealment of our place of operations had aroused suspicions. I
wondered whether it was wise to conceal information from Mattern.
Well, Leopold wanted to play his little game, I thought; and one way
to keep Mattern from seeing Ozymandias was not to tell him where
we would be working.
"I thought you said this planet was useless from your viewpoint,
Colonel," I remarked.
Mattern stared at me. "I'm sure of it. But it would be idiotic of me not
to have a look, wouldn't it—as long as we're spending the time here
anyway?"
I had to admit that he was right. "Do you expect to find anything,
though?"
He shrugged. "No fissionables, certainly. It's a safe bet that
everything radioactive on this planet has long since decomposed.
But there's always the possibility of lithium, you know."
"Or pure tritium," Leopold said acidly. Mattern merely laughed, and
made no reply.
Half an hour later we were bound westward again to the point where
we had left Ozymandias. Gerhardt, Webster and I rode together in
one halftrack, and Leopold and Marshall occupied the other. The
third, with two of Mattern's men and the prospecting equipment,
ventured off to the southeast toward the area Marshall and Webster
had fruitlessly combed the day before.
Ozymandias was where we had left him, with the sun coming up
behind him and glowing round his sides. I wondered how many
sunrises he had seen. Billions, perhaps.
We parked the halftracks not far from the robot and approached,
Webster filming him in the bright light of morning. A wind was
whistling down from the north, kicking up eddies in the sand.
"Ozymandias have remain here," the robot said as we drew near.
In English.
For a moment we didn't realize what had happened, but what
followed afterward was a five-man quadruple-take. While we gabbled

You might also like