Textbook Information Technology in Bio and Medical Informatics 5Th International Conference Itbam 2014 Munich Germany September 2 2014 Proceedings 1St Edition Miroslav Bursa Ebook All Chapter PDF
Textbook Information Technology in Bio and Medical Informatics 5Th International Conference Itbam 2014 Munich Germany September 2 2014 Proceedings 1St Edition Miroslav Bursa Ebook All Chapter PDF
Textbook Information Technology in Bio and Medical Informatics 5Th International Conference Itbam 2014 Munich Germany September 2 2014 Proceedings 1St Edition Miroslav Bursa Ebook All Chapter PDF
https://textbookfull.com/product/runtime-verification-5th-
international-conference-rv-2014-toronto-on-canada-
september-22-25-2014-proceedings-1st-edition-borzoo-bonakdarpour/
https://textbookfull.com/product/serious-games-development-and-
applications-5th-international-conference-sgda-2014-berlin-
germany-october-9-10-2014-proceedings-1st-edition-minhua-ma/
Computational Logistics 5th International Conference
ICCL 2014 Valparaiso Chile September 24 26 2014
Proceedings 1st Edition Rosa G. González-Ramírez
https://textbookfull.com/product/computational-logistics-5th-
international-conference-iccl-2014-valparaiso-chile-
september-24-26-2014-proceedings-1st-edition-rosa-g-gonzalez-
ramirez/
https://textbookfull.com/product/engineering-secure-software-and-
systems-6th-international-symposium-essos-2014-munich-germany-
february-26-28-2014-proceedings-1st-edition-jan-jurjens/
https://textbookfull.com/product/supercomputing-29th-
international-conference-isc-2014-leipzig-germany-
june-22-26-2014-proceedings-1st-edition-julian-martin-kunkel/
Miroslav Bursa
Sami Khuri
M. Elena Renda (Eds.)
Information Technology
LNCS 8649
in Bio- and
Medical Informatics
5th International Conference, ITBAM 2014
Munich, Germany, September 2, 2014
Proceedings
123
Lecture Notes in Computer Science 8649
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Miroslav Bursa Sami Khuri
M. Elena Renda (Eds.)
Information Technology
in Bio- and
Medical Informatics
5th International Conference, ITBAM 2014
Munich, Germany, September 2, 2014
Proceedings
13
Volume Editors
Miroslav Bursa
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Cybernetics
Technicka 2
166 27 Prague 6, Czech Republic
E-mail: [email protected]
Sami Khuri
San Jose State University
Department of Computer Science
One Washington Square
San Jose, CA 95192-0249, USA
E-mail: [email protected]
M. Elena Renda
Istituto di Informatica e Telematica del CNR
Via G. Moruzzi 1
56124 Pisa, Italy
E-mail: [email protected]
General Chair
Christian Böhm University of Munich, Germany
Program Committee
Werner Aigner FAW, Austria
Fuat Akal Functional Genomics Center Zurich,
Switzerland
Tatsuya Akutsu Kyoto University, Japan
Andreas Albrecht Queen’s University Belfast, Ireland
Peter Baumann Jacobs University Bremen, Germany
Balaram Bhattacharyya Visva-Bharati University, India
Veselka Boeva Technical University of Plovdiv, Bulgaria
Roberta Bosotti Nerviano Medical Science s.r.l., Italy
Rita Casadio University of Bologna, Italy
Sònia Casillas Universitat Autònoma de Barcelona, Spain
Kun-Mao Chao National Taiwan University, Taiwan
Vaclav Chudacek Czech Technical University in Prague,
Czech Republic
Hans-Dieter Ehrich Technical University of Braunschweig,
Germany
Christoph M. Friedrich University of Applied Sciences Dortmund,
Germany
Alejandro Giorgetti University of Verona, Italy
Jan Havlik Dep. of Circuit Theory, FEE, Czech Technical
University in Prague, Czech Republic
Volker Heun Ludwig-Maximilians-Universität München,
Germany
Larisa Ismailova NRNU MEPhI, Russia
Alastair Kerr University of Edinburgh, UK
VIII Organization
Poster Session
Knowledge Reasoning Model to Support Clinical Decision Making . . . . . . 75
Qingshan Li, Jing Feng, Lu Wang, Hua Chu, and WeiJuan Fu
1 Introduction
At the age of Whole Genome Shotgun (WGS) sequencing and information tech-
nology, development of new techniques and applications in biology to study
microorganisms is highly demanded in both clinical and environmental commu-
nities. The number of existing microbial species is estimated at 105 to 106 [1, 2].
This work was performed when Ramin Karimi was visiting the LIAS/ISAE-ENSMA
Lab. This visit is funded by ERASMUS mobility program. The work was also sup-
ported in part by the projects TMOP-4.2.2.C-11/1/KONV-2012-0001, and TMOP
4.2.4. A/2-11-1-2012-0001 supported by the European Union, co-financed by the
European Social Fund, and by the OTKA grant NK101680.
M. Bursa et al. (Eds.): ITBAM 2014, LNCS 8649, pp. 1–14, 2014.
c Springer International Publishing Switzerland 2014
2 R. Karimi et al.
The majority (> 99%) of microorganisms from the environment resist cultivation
in the laboratory [3] and it was impossible to investigate them until a few years
ago. With advances of next generation sequencing (NGS) and Metagenomics
techniques in the last few years, it is possible to obtain directly the genetic
content of all organisms with their complex communities gathered from natural
environment in which they normally live.
The output of sequencing technology is short fragments of DNA sequence with
25 base pairs (bp) to 900 (bp) lengths, called short reads. They vary from one
sequencing technology to another. For instance, sequencing machines made by
Illumina, Applied Biosystems (ABI), and Helicos of Cambridge produce short
sequences of 25 to 100 (bp).
Long DNA molecules extracted from the sample, are broken into smaller pieces
by special fragmentation and cloning techniques. Then, these small pieces are
fed into the sequencer for determining the order of nucleotides in short fragments
of DNA [4]. Sequencing output for a Metagenome sample is enormous data sets
containing the short reads of hundreds to thousands of known and unknown
organisms. Having efficient implementations to facilitate the analysis process is
urgently required in both biological and computational parts of any Metage-
nomics project. Figure 1 details the steps involved in a typical sequence-based
Metagenome project [5].
Fig. 1. A typical Metagenome project flow diagram. Dashed arrows indicate steps that
can be omitted.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 3
2 Background
In this section, we review the technologies and the concepts that we use in our
methodology.
Advances in parallel and distributed computing have opened new doors for
many researchers who could not access high performance computers (HPC). The
Apache Hadoop software library [18–20] is an open source framework, written in
Java. Hadoop, the application of parallel and distributed computing allows run-
ning simple programming models on large data sets across the nodes of a cluster.
The idea behind designing Hadoop is to store and run big data on commodity
hardware cluster nodes instead of expensive high performance computers which
are not available for everybody.
Hadoop handles any type of data from structured, unstructured, text files, log
files, images, audio files, communications records, etc. A Hadoop cluster has a
single Master and several Slave nodes. It can run as a single node cluster or multi
node cluster with thousands of nodes. The Hadoop core has two components:
Hadoop Distributed File System (HDFS) and MapReduce.
2.2 MapReduce
MapReduce is a programming model for data processing. It works by breaking
the process into two phases: the map phase and the reduce phase [18, 19]. The
two main components of MapReduce are:
– TaskTracker: As the slave, it receives the mapper and reducer task from
JobTracker and returns the results to the JobTracker after execution.
Hadoop is highly fault tolerant. In order to prevent any failure in the process,
HDFS creates multiple copies of data through the blocks, 3 copies by default.
NameNode can detect any failure in DataNodes or blocks and JobTracker also
can detect any failure of TaskTrackers and will replace them.
2.3 NoSQL
”NoSQL” Stands for Not Only SQL. The term ”NoSQL” was used by Carlo
Strozzi for the first time in 1998 [22]. It is a non-relational database [27]. One
of the aspects of NoSQL is its ability to handle database analytics of big data
sets in parallel and distributed platforms like Hadoop on commodity hardware.
Hive and Hbase are types of NoSQL applications on top of Apache Hadoop file
system. NoSQL databases can handle unstructured data such as text files, log
files, email, social media and multimedia. Horizontal scaling is one of the most
important features of NoSQL databases, and allows us to add more nodes to our
distributed system. Vertical scaling only allows to increase the power of existing
machine [23, 24].
2.4 Hive
Hive [19], [25] is a data warehousing infrastructure on top of Hadoop and HDFS.
HiveQL which is a SQL-like language, simplifies querying of unstructured large
datasets in distributed storage. Hive is designed to write once and read several
times. Real-time queries and row-level update are not possible. Hive is easy
to implement for everybody who is familiar with SQL queries. Facebook Data
Infrastructure Team started to create Hive in January 2007 to bring the familiar
concepts of tables, columns, partitions and a subset of SQL to the unstructured
world of Hadoop and it was open sourced in August 2008 [26]. Hive support
Bitmap Index from version 0.08.
column Grade having low cardinality. In this case our index has the same num-
ber of rows and the number of columns is equal to the number of distinct values
in column Grade. In table 1, cardinality of the column Grade is 4 because we
have 4 different values in it.
2.6 Hbase
Hbase [27] is a type of NoSQL database. It is an open-source, distributed,
column-oriented and scalable database built on the top of the Hadoop file sys-
tem. It is designed for random, real-time read/write access to very large tables
with billions of rows and millions of columns on commodity hardware.
3 Our Methodology
We have downloaded all complete Bacterial genomes from the National Cen-
ter for Biotechnology Information (NCBI) database [30]. The total number of
genomes was 2773 bacterial species and subspecies at the time (16.01.2014).
3.1 Insignia
Insignia is a pipeline to generate unique DNA signatures and it is also a database
and web application for obtaining DNA signatures. It contains 11274 viruses/
phages and 2653 non-viruses signatures with a length between 18 to 500 bp.
Insignia detect signatures for designing primers in Polymerase Chain Reaction
(PCR) and probes in micro-array technologies. The signatures can also be used
for real-time identification of species in microbial and viral assays [15, 16], [31].
We downloaded DNA signatures for two groups of 50 bacteria from the in-
signia database. As we are in the testing process, we just downloaded the signa-
tures with length of 18 bp. As an example, Table 2 consists of the head part of
Acholeplasma laidlawii DNA signatures.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 7
Insignia V0.7
Signatures calculated: Thu Mar 6 2014 10:58:06
Reference Organism:
Acholeplasma laidlawii PG-8A
Target Organism(s):
Signatures:
Index Start stop sequence
63965 451703 451720 ACATAAGCAGGTGCGGAA
63966 670606 670623 GATACCAATACCGCAGAT
63967 692909 692926 CCCATTCAACTTCGATCA
63968 530281 530298 ATCAACGCTAGATGAGCA
63969 268209 268226 ATGGAGGAGTCTGGATAC
63970 69763 69780 ACAGCAACAGCGTATATC
63971 357337 357354 GTGTTAGCGTTAAGTCTG
63972 1001550 1001567 TAGCCTCTTTAAGCAGGT
63973 1366201 1366218 ATGATGCAAGTGGCATGG
63974 1141698 1141715 TGCAACGGATGCATCAAG
3.2 Metasim
Metasim is a sequencing simulator application for genomics and Metagenomics
studies. It can be a great help to develop and improve Metagenomics tools, and
for planning Metagenomics projects [32, 33]. Metasim can simulate the short
reads of Roches 454 pyrosequencing, Sanger sequencing and Empirical sequenc-
ing technology. In this paper, we use Roches 454 pyrosequencing simulation.
The output of Metasim is a compressed file containing the short reads of a
bacterial chromosome or one of its Plasmids and their information.
>r16.1|SOURCES={GI=11497281,bw,1947919816}|ERRORS={8 1:C,46:
,135 1:T,160 1:A,190 1:G}|SOURCE 1=”Borrelia burgdorferi B31 plasmid cp32-8”
(44840ff90be8dcf7b704d6908ca095d559d2949e)
TTTAGGATTCGTACCCGTTTTCTTCTAATTTTTTCCTAGTGTTGTATGAATTT
CTTTTAATTTTTTTTGTTTTTCTTTCATGCAAGATTTTTTTATATTGAATTTT
TTTATTAGGGCAATTTCATTTTGTTTTAAGTATATTTATTGCCTCAATCTTAG
TATACTTTATCAATATTTAAATACAAAATAGAAAGGAGCTTCTTCCGTTTTAA
AGTTACAATTATTGAAATAATTTCTTAGTTGATATTTTTCTATTTCTTTAATC
TTTCTTTCTTCTTTTATATTATTTTTATTA
We chose 100 bacterial genomes from NCBI data set for simulating the short
reads. The first group of 50 bacteria from Insignia database are common in 100
chosen bacterial genomes and the other group is from some other bacteria apart
from these 100.
8 R. Karimi et al.
Table 3. An example for our index tables; each column of these tables is kept as a
single file
Another way is to keep every bacteria as a column. We store ’1’ if any signature
of the bacteria exists in a short read, ’0’ if not. In this case the table is much
smaller. The number of columns is equal to the number of bacteria plus two more
columns, one for row identification and the other for short reads. The number
of rows is equal to the number of short reads.
We can easily use Linux paste command to put all the files together as a
single file. As an example, in Table 4 we have 6 files. One file contains the reads
and their identification numbers and the other five contain ’0’ and ’1’ for five
bacteria.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 9
1 R1 0 0 0 1 0
2 R2 1 0 0 0 0
3 R3 0 0 0 1 0
4 R4 0 0 0 0 0
5 R5 0 0 1 0 0
6 R6 0 0 0 0 1
7 R7 1 0 0 0 0
8 R8 0 0 0 0 0
9 R9 0 0 0 0 0
10 R10 1 0 0 0 0
Then, we should create our table in Hive according to the newFile.txt struc-
ture.
hive> CREATE TABLE testTable1 ( rid INT, reads STRING, b1 INT, b2 INT
, b3 INT, b4 INT, b5 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
’\t’ STORED AS TEXTFILE;
hive> INSERT OVERWRITE LOCAL DIRECTORY ’/path to local dir for output’
select testTable1.rid from testTable1 where b1=1;
The output file contains the Rowid numbers of the short reads. As an example,
for the bacteria b1 in table 4 we have 2,7,10 which means, the signatures of
bacteria b1 are in these 3 short reads. Moreover, we have bacteria b1 in the
Metagenome sample.
In this approach, we have to repeat the query for all bacteria one by one or
write a long query and a long command to create the table. Hence, the bigger
the number of bacteria, the longer the implementation.
There is a better solution to prevent repeating the queries or writing long
commands and queries. We can add all bacterial files with ’0’ and ’1’ one after
the other and create a single column in a file with the cat command.
For big number of bacteria, we can use bash script in the incremental order
to add as much bacteria as we need at the end of each other quickly.
In this method, we need also to repeat short reads in a single column as much
as the number of bacteria. For instance, if we have 500,000 short reads and 1000
10 R. Karimi et al.
bacteria, then we should repeat short reads in one column 1000 times with the
cat command and the total number will be 500,000,000.
Next, we need to create a table with 3 columns (rid INT, reads STRING, b
INT) and run the query just once. The results will be in one column. We can
easily extract the information with Rowid numbers. It leads to a larger file size,
but a faster implementation. After getting the results, we can delete these large
tables.
We created testtable1 with 52 columns (rid INT, reads STRING, b1 INT,...,
b50 INT) and testtable2 with 3 columns (rid INT, reads STRING, b INT)
in Hive. We have short reads of 100 bacteria and two groups of 50 bacterial
signatures.
As we are in the testing process and we use Java programming without Hadoop
and MapReduce for searching signatures in the short reads to create our index
files (tables), we chose only 10% of short reads randomly.
Our future work is defining MapReduce in our Java program and using multi-
node cluster Hadoop in order to speed up this step.
We used the awk command to add Row identification (Rowid) to the file
contains short reads (132,705).
awk ’BEGIN{i=1} {if($0 !~ /^$/) {printf ("%d\t%s \n",i,$0); i++}
else { print $0} }’ reads.txt >> readsid.txt
We merged this file and all the 50 index files with the paste command into
a single file and load this file in the testtable1 in Hive. Then, we used queries
to search our table. We have done this process for both groups of 50 bacteria.
For the second table (testtable2) we attached all 50 bacteria in order as
one column in a single file and also repeat the short reads 50 times in a sin-
gle column, both with the cat command. Then, we added Rowid to the short
reads (6,635,250) and finally paste these three columns in a file and load it to
testtable2. In this case, we only need one query to get the results. It can be a
good test to see the speed and efficiency of Hive to search millions of rows with
Bitmap Index techniques.
There is a possibility of integrating Hive and Hbase. This feature allows Hive
QL statements to access HBase tables for both read (SELECT) and write (IN-
SERT). It is even possible to combine access to HBase tables with native Hive
tables via joins and unions [34]. Real-time reading and writing is possible in
Hbase. These features help us update and have faster implementation.
4 Experimental Study
All these implementations are done by Intel dual-core CPU and 4 GB of RAM,
Ubuntu 13.10, single-node-cluster Hadoop-1.2.1 and Hive-0.11.0. We can see the
elapsed time for our first implementation on testtable1 with 52 columns and
132,705 rows and the loaded file size of 44.6 MB as given in Table 5. We repeated
the query for all 50 columns. We did not consider the time for changing and
repeating the queries.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 11
Table 5. Time taken for running the Hive query on testtable1 columns. Total time is
1065.927 Sec.
b1: 25.543 Sec. b11: 21.356 Sec. b21: 22.065 Sec. b31: 21.089 Sec. b41: 21.081 Sec.
b2: 22.236 Sec. b12: 21.120 Sec. b22: 21.116 Sec. b32: 22.013 Sec. b42: 21.016 Sec.
b3: 22.224 Sec. b13: 21.123 Sec. b23: 21.017 Sec. b33: 21.074 Sec. b43: 22.062 Sec.
b4: 21.187 Sec. b14: 22.065 Sec. b24: 20.977 Sec. b34: 20.991 Sec. b44: 21.000 Sec.
b5: 22.322 Sec. b15: 21.090 Sec. b25: 21.062 Sec. b35: 21.277 Sec. b45: 21.036 Sec.
b6: 21.167 Sec. b16: 21.083 Sec. b26: 21.057 Sec. b36: 20.010 Sec. b46: 21.009 Sec.
b7: 20.049 Sec. b17: 21.048 Sec. b27: 21.188 Sec. b37: 20.986 Sec. b47: 21.002 Sec.
b8: 20.048 Sec. b18: 21.123 Sec. b28: 21.108 Sec. b38: 22.063 Sec. b48: 20.997 Sec.
b9: 21.091 Sec. b19: 21.003 Sec. b29: 20.991 Sec. b39: 21.110 Sec. b49: 21.029 Sec.
b10: 22.373 Sec. b20: 21.072 Sec. b30: 21.081 Sec. b40: 20.952 Sec. b50: 22.136 Sec.
This implementation was for the first group of 50 bacteria which are common
in 100 bacterial samples. As we expected, we could find some short reads con-
taining the signatures for every bacteria. The number of short reads is a range
between 1 for b16 to 812 for b4.
As we expected, for the second group of 50 bacteria which differs by 100
samples, we could not find any short reads containing the signatures. The average
time taken for the implementation was almost the same as the first group.
Computational times for the second implementation on testtable2 with 3
columns and 6,635,250 rows and the loaded file size of 1.6 GB are:
File Size: 1.6 GB
Loading data to testtable2
Time taken: 45.588 seconds
5 Conclusion
In this paper, we show the contributions of High Performance Computing and op-
timization techniques issued from databases to speed up searching and matching
a large amount of DNA signature in the short reads of hundreds (thousands) of
different microorganisms deployed in Hive. We adapt the concept of bitmap in-
dexes, routinely used in indexing large database tables for attributes with little
cardinality (such as gender). This preliminary work gives encouraging results and
opens new research perspectives to exploit optimization techniques issued from
databases and High Performance Computing in Bioinformatics. We are currently
testing our proposal on multi-node cluster Hadoop to speed up the process.
References
1. Tiedje, J.M.: Microbial diversity: of value to whom. ASM News 60(10), 524–525
(1994)
2. Allsopp, D., Colwell, R.R., Hawksworth, D.L., et al.: Microbial Diversity and
Ecosystem Function: Proceedings of the IUBS/IUMS Workshop held at Egham,
UK, August 10-13. CAB INTERNATIONAL (1995)
3. Kaeberlein, T., Lewis, K., Epstein, S.S.: Isolating “uncultivable” microorganisms
in pure culture in a simulated natural environment. Science 296(5570), 1127–1129
(2002)
4. Trapnell, C., Salzberg, S.L.: How to map billions of short reads onto genomes.
Nature Biotechnology 27(5), 455 (2009)
5. Thomas, T., Gilbert, J., Meyer, F.: Metagenomics-a guide from sampling to data
analysis. Microb. Inform. Exp. 2(3) (2012)
6. Haubold, B., Reed, F.A., Pfaffelhuber, P.: Alignment-free estimation of nucleotide
diversity. Bioinformatics 27(4), 449–455 (2011)
7. Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLoS Compu-
tational Biology 6(2), e1000667 (2010)
8. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing:
computational challenges and solutions. Nature Reviews Genetics 13(1), 36–46
(2012)
9. Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree
construction. Bioinformatics 19(16), 2122–2130 (2003)
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 13
10. Li, C., Yang, Y., Jia, M., Zhang, Y., Yu, X., Wang, C.: Phylogenetic analysis
of DNA sequences based on k-word and rough set theory. Physica A: Statistical
Mechanics and its Applications 398, 162–171 (2014)
11. Nagar, A., Hahsler, M.: Genomic sequence fragment identification using quasi-
alignment. In: Proceedings of the International Conference on Bioinformatics, Com-
putational Biology and Biomedical Informatics, p. 359. ACM (2013)
12. Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinfor-
matics 19(4), 513–523 (2003)
13. Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-free sequence
comparison based on next generation sequencing reads: Extended abstract. In:
Chor, B. (ed.) RECOMB 2012. LNCS, vol. 7262, pp. 272–285. Springer, Heidelberg
(2012)
14. Srinivasan, S.M., Guda, C.: MetaID: A novel method for identification and quan-
tification of metagenomic samples. BMC Genomics 14(8), 1–12 (2013)
15. Phillippy, A.M., Mason, J.A., Ayanbule, K., Sommer, D.D., Taviani, E., Huq, A.,
... Salzberg, S.L.: Comprehensive DNA signature discovery and validation. PLoS
Computational Biology 3(5), e98 (2007)
16. Phillippy, A.M., Ayanbule, K., Edwards, N.J., Salzberg, S.L.: Insignia: a DNA
signature search web server for diagnostic assay development. Nucleic Acids Re-
search 37(suppl. 2), W229–W234 (2009)
17. Satya, R.V., Kumar, K., Zavaljevski, N., Reifman, J.: A high-throughput pipeline
for the design of real-time pcr signatures. BMC Bioinformatics 11(1), 340 (2010)
18. Apache Hadoop available at http://hadoop.apache.org/
19. White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2012)
20. Cloudera Frequently Asked Questions (FAQs),
http://www.cloudera.com/content/cloudera/en/why-cloudera/
hadoop-and-big-data.html
21. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file
system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies
(MSST), pp. 1–10. IEEE (2010)
22. NoSQL Relational Database Management System homepage,
http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20Page
23. Michael, M., Moreira, J.E., Shiloach, D., Wisniewski, R.W.: Scale-up x scale-out:
A case study using nutch/lucene. In: IEEE International Parallel and Distributed
Processing Symposium, IPDPS 2007, pp. 1–8. IEEE (2007)
24. Bondi, A.B.: Characteristics of scalability and their impact on performance. In:
Proceedings of the 2nd International Workshop on Software and Performance,
pp. 195–203. ACM (2000)
25. Apache Hive available at http://hive.apache.org
26. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S.,
Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: 2010
IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005.
IEEE (2010)
27. Apache HBase available at http://hbase.apache.org
28. Karande, N.D.: Efficient indexing technique using bitmap indices for data ware-
houses. International Journal 1(4) (2013)
29. Bellatreche, L., Missaoui, R., Necir, H., Drias, H.: A data mining approach for
selecting bitmap join indices. JCSE 1(2), 177–194 (2007)
14 R. Karimi et al.
1 Introduction
Cell lineage trees encode the cell division events over time and can be represented
as binary trees. These trees challenge current machine learning techniques to give
These authors contributed equally.
M. Bursa et al. (Eds.): ITBAM 2014, LNCS 8649, pp. 15–29, 2014.
c Springer International Publishing Switzerland 2014
16 V. Khakhutskyy et al.
G M
Lineage trees
a broader view and a more accurate interpretation of the underlying cell devel-
opment processes. Our subject of interest is labeled lineage trees from cells of
the blood system as depicted in Figure 1. In this work, we use single-cell data
of time-lapse microscopy experiments encoded as trees with root nodes belong-
ing to blood progenitor cells that differentiate into more specialized cell types
(leaves). In particular, granulocyte-macrophage progenitor cells (GMPs) evolve
into mature macrophages (M) or granulocytes (G). Additionally, we measure a
fluorescence marker (LysM::GFP) that indicates whether a differentiation into
M or G has taken place [21]. However, this marker only implies if a cell has
lost its progenitor state but gives no information about its particular lineage.
Therefore, we aim to find differences in the lineage tree structures between the
two differentiation programs.
The differentiation process can be instructed by additional cytokines leading
to almost exclusively differentiated cells of one lineage [21]. To determine a typ-
ical lineage-specific tree we analyze lineage trees instructed to one or the other
lineage and calculate tree distances based on different metrics. Next, we devel-
oped a method to assign a representative tree for every condition. This enables
us to distinguish different cell types just by looking at their characteristic repre-
sentatives. Furthermore, we developed a method to cluster a set of lineage trees
based on k-medoid methods, which unlike k-means, is more robust to noise and
outliers that are common in the real biological datasets like ours. With this tech-
nique we partition the data into naturally evolving parts allowing to gain insights
into typical lineage tree structures of differentiating blood progenitor cells.
In short, the contributions of this paper are as follows:
– Tree Clustering: We find similarities between a set of trees covering the whole
pedigree of a progenitor cell.
– Representative Centroid Trees: We are able to generate a set of fitting cen-
troid trees that represent the characteristics of the underlying clusters.
Centroid Clustering of Cellular Lineage Trees 17
2 Related Work
Trees play an important role for the scientific areas which use tree structures
to describe observations, e.g. computational biology, structured text databases,
natural language processing, web mining, image analysis and computer vision,
pattern recognition as well as compiler optimization [11,4,8]. Especially the min-
ing of web data like xml-files [15,9] and decision tree clustering [19] is widely
discussed in literature.
All discussed clustering methods in this paper require a notion of distance
between trees. Unfortunately, the scientific community does not agree on one es-
tablished method of finding a metric between trees. One commonly used method
is the Tree Edit Distance (TED) [29]. Similar to Levenstein edit distance, TED
is defined as the minimal number of operations needed to transform one tree
into another. But Arora et al. showed that for unordered labeled trees as con-
sidered in this paper the calculation of TED is NP-hard, even MAX SNP-hard
[2]. Also to apply the metrics for ordered labeled trees to unordered trees would
lead to a considerable loss of efficiency. Zhang [28] suggested to use constrained
TED (cTED ) to calculate the metrics for unordered trees. cTED is a dynamic
programming method that solves a large optimization problem by breaking it
down into smaller sub-problems. Another suitable method to establish a metric
for the space of unordered labeled trees was suggested by Torsello et al. [25].
This method is based on the computation of a maximal similarity (MaxSim-
ilarity) common subtree between two trees. We will compare cTED and four
MaxSimilarity tree metrics in our evaluations.
Tree clustering for shape recognition was intensively studied in the group
around Torsello and Hanckock. In 2001 Luo et al. used an EM-like algorithm for
clustering 2D binary shapes based on the edit distances of their shock-trees from
the Hamilton-Jacobi skeleton [14]. Since then the group published a number of
methods for tree clustering focusing on pattern recognition of 2D binary shapes.
It was also suggested to cluster trees after embedding them into a so-called union
tree space [24] or into the euclidean space [26].
Graph clustering has gained interest in the last decade in the machine learning
community. It is related to the problem discussed in this paper since trees can
be considered as a special case of undirected acyclic labeled graphs. A centroid
18 V. Khakhutskyy et al.
based k-means algorithm was suggested by Jain and Wysotzki [10]. Ferrer et al.
discussed central clustering using k-medoids and k-medians approaches [7]. Some
methods aim to embed graphs into a metric vector space, e.g. the spectral em-
bedding method suggested by Luo, Wilson, and Hancock [13]. These algorithms,
however, are not directly applicable to tree clustering problems as the resulting
mean or median graphs are not necessarily proper trees. Moreover, as graphs are
a more general data structure, algorithms for distance calculation on graphs often
require significantly higher computational costs than their counterparts on trees.
We developed a method for finding centroid trees in a set of unordered la-
beled trees that has an intuitive interpretation, does not rely on a vector space
embedding, and can be used with different similarity metrics as we will show in
the experimental section.
Finally, we would like to mention that search for frequent common subtrees in
a tree database as a method to obtain a condensed representation of pattern in
trees has gained popularity in recent years [3,18,27]. The search algorithms are
tangential to our current research as they do not lead to clustering. However,
given a group of trees they could help one to find a meaningful interpretation of
the results.
∗
Using Wσ (φ ) Torsello at al. define and prove the properties of the MaxSim-
ilarity metrics listed in Table 1.
Table 1. Different metrics used in this work to calculate distances between trees
Title: Ozymandias
Language: English
By IVAR JORGENSON
In any event, once we had thrashed out the matter of whether or not
we were going to stay here or pull up and head for the next planet on
our schedule, the five of us set to work. We knew we had only a
week—Mattern would never grant us an extension unless we came
up with something good enough to change his mind, which was
improbable—and we wanted to get as much done in that week as
possible. With the sky as full of worlds as it is, this planet might
never be visited by Earth scientists again.
Mattern and his men served notice right away that they were going
to help us, but reluctantly and minimally. We unlimbered the three
small halftracks carried aboard ship and got them into functioning
order. We stowed our gear—cameras, pick-&-shovels, camel's-hair
brushes—and donned our breathing-masks, and Mattern's men
helped us get the halftracks out of the ship and pointed in the right
direction.
Then they stood back and waited for us to shove off.
"Don't any of you plan to accompany us?" Leopold asked. The
halftracks each held up to four men.
Mattern shook his head. "You fellows go out by yourselves today and
let us know what you find. We can make better use of the time filing
and catching up on back log entries."
I saw Leopold start to scowl. Mattern was being openly
contemptuous; the least he could do was have his men make a
token search for fissionable or fusionable matter! But Leopold
swallowed down his anger.
"Okay," he said. "You do that. If we come across any raw veins of
plutonium I'll radio back."
"Sure," Mattern said. "Thanks for the favor. Let me know if you find a
brass mine, too." He laughed harshly. "Raw plutonium! I half believe
you're serious!"
We had worked out a rough sketch of the area, and we split up into
three units. Leopold, alone, headed straight due west, toward the dry
riverbed we had spotted from the air. He intended to check alluvial
deposits, I guess.
Marshall and Webster, sharing one halftrack, struck out to the hilly
country southeast of our landing point. A substantial city appeared to
be buried under the sand there. Gerhardt and I, in the other vehicle,
made off to the north, where we hoped to find remnants of yet
another city. It was a bleak, windy day; the endless sand that
covered this world mounted into little dunes before us, and the wind
picked up handfuls and tossed it against the plastite dome that
covered our truck. Underneath the steel cleats of our tractor-belt,
there was a steady crunch-crunch of metal coming down on sand
that hadn't been disturbed in millennia.
Neither of us spoke for a while. Then Gerhardt said, "I hope the
ship's still there when we get back to the base."
Frowning, I turned to look at him as I drove. Gerhardt had always
been an enigma: a small scrunchy guy with untidy brown hair
flapping in his eyes, eyes that were set a little too close together. He
had a degree from the University of Kansas and had put in some
time on their field staff with distinction, or so his references said.
I said, "What the hell do you mean?"
"I don't trust Mattern. He hates us."
"He doesn't. Mattern's no villain—just a fellow who wants to do his
job and go home. But what do you mean, the ship not being there?"
"He'll blast off without us. You see the way he sent us all out into the
desert, and kept his own men back. I tell you, he'll strand us here!"
I snorted. "Don't be a paranoid. Mattern won't do anything of the
sort."
"He thinks we're dead weight on the expedition," Gerhardt insisted.
"What better way to get rid of us?"
The halftrack breasted a hump in the desert. I kept wishing a vulture
would squeal somewhere, but there was not even that. Life had left
this world ages ago. I said, "Mattern doesn't have much use for us,
sure. But would he blast off and leave three perfectly good halftracks
behind? Would he?"
It was a good point. Gerhardt grunted agreement after a while.
Mattern would never toss equipment away, though he might not have
such scruples about five surplus archaeologists.
We rode along silently for a while longer. By now we had covered
twenty miles through this utterly barren land. As far as I could see,
we might just as well have stayed at the ship. At least there we had a
surface lie of building foundations.
But another ten miles and we came across our city. It seemed to be
of linear form, no more than half a mile wide and stretching out as far
as we could see—maybe six or seven hundred miles; if we had time,
we would check the dimensions from the air.
Of course it wasn't much of a city. The sand had pretty well covered
everything, but we could see foundations jutting up here and there,
weathered lumps of structural concrete and reinforced metal. We got
out and unpacked the power-shovel.
An hour later, we were sticky with sweat under our thin spacesuits
and we had succeeded in transferring a few thousand cubic yards of
soil from the ground to an area a dozen yards away. We had dug
one devil of a big hole in the ground.
And we had nothing.
Nothing. Not an artifact, not a skull, not a yellowed tooth. No spoons,
no knives, no baby-rattles.
Nothing.
The foundations of some of the buildings had endured, though
whittled down to stumps by a million years of sand and wind and
rain. But nothing else of this civilization had survived. Mattern, in his
scorn, had been right, I admitted ruefully: this planet was as useless
to us as it was to them. Weathered foundations could tell us little
except that there had once been a civilization here. An imaginative
paleontologist can reconstruct a dinosaur from a fragment of a thigh-
bone, can sketch out a presentable saurian with only a fossilized
ischium to guide him. But could we extrapolate a culture, a code of
laws, a technology, a philosophy, from bare weathered building
foundations?
Not very likely.
We moved on and dug somewhere else half a mile away, hoping at
least to unearth one tangible remnant of the civilization that had
been. But time had done its work; we were lucky to have the building
foundations. All else was gone.
"Boundless and bare, the lone and level sands stretch far away," I
muttered.
Gerhardt looked up from his digging. "Eh? What's that?" he
demanded.
"Shelley," I told him.
"Oh. Him."
He went back to digging.
Late in the afternoon we finally decided to call it quits and head back
to the base. We had been in the field for seven hours, and had
nothing to show for it except a few hundred feet of tridim films of
building foundations.
The sun was beginning to set; Planet Four had a thirty-five hour day,
and it was coming to its end. The sky, always somber, was darkening
now. There was no moon to be still as bright. Planet Four had no
satellites. It seemed a bit unfair; Three and Five of the system each
had four moons, while around the massive gas giant that was Eight a
cluster of thirteen moonlets whirled.
We wheeled round and headed back, taking an alternate route three
miles east of the one we had used on the way out, in case we might
spot something. It was a forlorn hope, though.
Six miles along our journey, the truck radio came to life. The dry,
testy voice of Dr. Leopold reached us:
"Calling Trucks Two and Three. Two and Three, do you read me?
Come in, Two and Three."
Gerhardt was driving. I reached across his knee to key in the
response channel and said, "Anderson and Gerhardt in Number
Three, sir. We read you."
A moment later, somewhat more faintly, came the sound of Number
Two keying into the threeway channel, and I heard Marshall saying,
"Marshall and Webster in Two, Dr. Leopold. Is something wrong?"
"I've found something," Leopold said.
From the way Marshall exclaimed "Really!" I knew that Truck
Number Two had had no better luck than we. I said, "That makes
one of us, then."
"You've had no luck, Anderson?"
"Not a scrap. Not a potsherd."
"How about you, Marshall?"
"Check. Scattered signs of a city, but nothing of archaeological
value, sir."
I heard Leopold chuckle before he said, "Well, I've found something.
It's a little too heavy for me to manage by myself. I want both outfits
to come out here and take a look at it."
"What is it, sir?" Marshall and I asked simultaneously, in just about
the same words.
But Leopold was fond of playing the Man of Mystery. He said, "You'll
see when you get here. Take down my coordinates and get a move
on. I want to be back at the base by nightfall."
At the base that night, Colonel Mattern and his seven aides were
remarkably curious about our day's activities. They tried to make it
seem as if they were taking a sincere interest in our work, but it was
perfectly obvious to us that they were simply goading us into telling
them what they had anticipated—that we had found absolutely
nothing. This was the response they got, since Leopold forbade
mentioning Ozymandias. Aside from the robot, the truth was that we
had found nothing, and when they learned of this they smiled
knowingly, as if saying that had we listened to them in the first place
we would all be back on Earth seven days earlier, with no loss.
The following morning after breakfast Mattern announced that he
was sending out a squad to look for fusionable materials, unless we
objected.
"We'll only need one of the halftracks," he said. "That leaves two for
you. You don't mind, do you?"
"We can get along with two," Leopold replied a little sourly. "Just so
you keep out of our territory."
"Which is?"
Instead of telling him, Leopold merely said, "We've adequately
examined the area to the southeast of here, and found nothing of
note. It won't matter to us if your geological equipment chews the
place up."
Mattern nodded, eyeing Leopold curiously as if the obvious
concealment of our place of operations had aroused suspicions. I
wondered whether it was wise to conceal information from Mattern.
Well, Leopold wanted to play his little game, I thought; and one way
to keep Mattern from seeing Ozymandias was not to tell him where
we would be working.
"I thought you said this planet was useless from your viewpoint,
Colonel," I remarked.
Mattern stared at me. "I'm sure of it. But it would be idiotic of me not
to have a look, wouldn't it—as long as we're spending the time here
anyway?"
I had to admit that he was right. "Do you expect to find anything,
though?"
He shrugged. "No fissionables, certainly. It's a safe bet that
everything radioactive on this planet has long since decomposed.
But there's always the possibility of lithium, you know."
"Or pure tritium," Leopold said acidly. Mattern merely laughed, and
made no reply.
Half an hour later we were bound westward again to the point where
we had left Ozymandias. Gerhardt, Webster and I rode together in
one halftrack, and Leopold and Marshall occupied the other. The
third, with two of Mattern's men and the prospecting equipment,
ventured off to the southeast toward the area Marshall and Webster
had fruitlessly combed the day before.
Ozymandias was where we had left him, with the sun coming up
behind him and glowing round his sides. I wondered how many
sunrises he had seen. Billions, perhaps.
We parked the halftracks not far from the robot and approached,
Webster filming him in the bright light of morning. A wind was
whistling down from the north, kicking up eddies in the sand.
"Ozymandias have remain here," the robot said as we drew near.
In English.
For a moment we didn't realize what had happened, but what
followed afterward was a five-man quadruple-take. While we gabbled