Immediate Download Information Technology in Bio and Medical Informatics 5th International Conference ITBAM 2014 Munich Germany September 2 2014 Proceedings 1st Edition Miroslav Bursa Ebooks 2024
Immediate Download Information Technology in Bio and Medical Informatics 5th International Conference ITBAM 2014 Munich Germany September 2 2014 Proceedings 1st Edition Miroslav Bursa Ebooks 2024
Immediate Download Information Technology in Bio and Medical Informatics 5th International Conference ITBAM 2014 Munich Germany September 2 2014 Proceedings 1st Edition Miroslav Bursa Ebooks 2024
com
https://textbookfull.com/product/information-
technology-in-bio-and-medical-informatics-5th-
international-conference-itbam-2014-munich-
germany-september-2-2014-proceedings-1st-edition-
miroslav-bursa/
DOWLOAD NOW
https://textbookfull.com/product/runtime-verification-5th-
international-conference-rv-2014-toronto-on-canada-
september-22-25-2014-proceedings-1st-edition-borzoo-bonakdarpour/
https://textbookfull.com/product/serious-games-development-and-
applications-5th-international-conference-sgda-2014-berlin-
germany-october-9-10-2014-proceedings-1st-edition-minhua-ma/
Computational Logistics 5th International Conference
ICCL 2014 Valparaiso Chile September 24 26 2014
Proceedings 1st Edition Rosa G. González-Ramírez
https://textbookfull.com/product/computational-logistics-5th-
international-conference-iccl-2014-valparaiso-chile-
september-24-26-2014-proceedings-1st-edition-rosa-g-gonzalez-
ramirez/
https://textbookfull.com/product/engineering-secure-software-and-
systems-6th-international-symposium-essos-2014-munich-germany-
february-26-28-2014-proceedings-1st-edition-jan-jurjens/
https://textbookfull.com/product/supercomputing-29th-
international-conference-isc-2014-leipzig-germany-
june-22-26-2014-proceedings-1st-edition-julian-martin-kunkel/
Miroslav Bursa
Sami Khuri
M. Elena Renda (Eds.)
Information Technology
LNCS 8649
in Bio- and
Medical Informatics
5th International Conference, ITBAM 2014
Munich, Germany, September 2, 2014
Proceedings
123
Lecture Notes in Computer Science 8649
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Miroslav Bursa Sami Khuri
M. Elena Renda (Eds.)
Information Technology
in Bio- and
Medical Informatics
5th International Conference, ITBAM 2014
Munich, Germany, September 2, 2014
Proceedings
13
Volume Editors
Miroslav Bursa
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Cybernetics
Technicka 2
166 27 Prague 6, Czech Republic
E-mail: [email protected]
Sami Khuri
San Jose State University
Department of Computer Science
One Washington Square
San Jose, CA 95192-0249, USA
E-mail: [email protected]
M. Elena Renda
Istituto di Informatica e Telematica del CNR
Via G. Moruzzi 1
56124 Pisa, Italy
E-mail: [email protected]
General Chair
Christian Böhm University of Munich, Germany
Program Committee
Werner Aigner FAW, Austria
Fuat Akal Functional Genomics Center Zurich,
Switzerland
Tatsuya Akutsu Kyoto University, Japan
Andreas Albrecht Queen’s University Belfast, Ireland
Peter Baumann Jacobs University Bremen, Germany
Balaram Bhattacharyya Visva-Bharati University, India
Veselka Boeva Technical University of Plovdiv, Bulgaria
Roberta Bosotti Nerviano Medical Science s.r.l., Italy
Rita Casadio University of Bologna, Italy
Sònia Casillas Universitat Autònoma de Barcelona, Spain
Kun-Mao Chao National Taiwan University, Taiwan
Vaclav Chudacek Czech Technical University in Prague,
Czech Republic
Hans-Dieter Ehrich Technical University of Braunschweig,
Germany
Christoph M. Friedrich University of Applied Sciences Dortmund,
Germany
Alejandro Giorgetti University of Verona, Italy
Jan Havlik Dep. of Circuit Theory, FEE, Czech Technical
University in Prague, Czech Republic
Volker Heun Ludwig-Maximilians-Universität München,
Germany
Larisa Ismailova NRNU MEPhI, Russia
Alastair Kerr University of Edinburgh, UK
VIII Organization
Poster Session
Knowledge Reasoning Model to Support Clinical Decision Making . . . . . . 75
Qingshan Li, Jing Feng, Lu Wang, Hua Chu, and WeiJuan Fu
1 Introduction
At the age of Whole Genome Shotgun (WGS) sequencing and information tech-
nology, development of new techniques and applications in biology to study
microorganisms is highly demanded in both clinical and environmental commu-
nities. The number of existing microbial species is estimated at 105 to 106 [1, 2].
This work was performed when Ramin Karimi was visiting the LIAS/ISAE-ENSMA
Lab. This visit is funded by ERASMUS mobility program. The work was also sup-
ported in part by the projects TMOP-4.2.2.C-11/1/KONV-2012-0001, and TMOP
4.2.4. A/2-11-1-2012-0001 supported by the European Union, co-financed by the
European Social Fund, and by the OTKA grant NK101680.
M. Bursa et al. (Eds.): ITBAM 2014, LNCS 8649, pp. 1–14, 2014.
c Springer International Publishing Switzerland 2014
2 R. Karimi et al.
The majority (> 99%) of microorganisms from the environment resist cultivation
in the laboratory [3] and it was impossible to investigate them until a few years
ago. With advances of next generation sequencing (NGS) and Metagenomics
techniques in the last few years, it is possible to obtain directly the genetic
content of all organisms with their complex communities gathered from natural
environment in which they normally live.
The output of sequencing technology is short fragments of DNA sequence with
25 base pairs (bp) to 900 (bp) lengths, called short reads. They vary from one
sequencing technology to another. For instance, sequencing machines made by
Illumina, Applied Biosystems (ABI), and Helicos of Cambridge produce short
sequences of 25 to 100 (bp).
Long DNA molecules extracted from the sample, are broken into smaller pieces
by special fragmentation and cloning techniques. Then, these small pieces are
fed into the sequencer for determining the order of nucleotides in short fragments
of DNA [4]. Sequencing output for a Metagenome sample is enormous data sets
containing the short reads of hundreds to thousands of known and unknown
organisms. Having efficient implementations to facilitate the analysis process is
urgently required in both biological and computational parts of any Metage-
nomics project. Figure 1 details the steps involved in a typical sequence-based
Metagenome project [5].
Fig. 1. A typical Metagenome project flow diagram. Dashed arrows indicate steps that
can be omitted.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 3
2 Background
In this section, we review the technologies and the concepts that we use in our
methodology.
Advances in parallel and distributed computing have opened new doors for
many researchers who could not access high performance computers (HPC). The
Apache Hadoop software library [18–20] is an open source framework, written in
Java. Hadoop, the application of parallel and distributed computing allows run-
ning simple programming models on large data sets across the nodes of a cluster.
The idea behind designing Hadoop is to store and run big data on commodity
hardware cluster nodes instead of expensive high performance computers which
are not available for everybody.
Hadoop handles any type of data from structured, unstructured, text files, log
files, images, audio files, communications records, etc. A Hadoop cluster has a
single Master and several Slave nodes. It can run as a single node cluster or multi
node cluster with thousands of nodes. The Hadoop core has two components:
Hadoop Distributed File System (HDFS) and MapReduce.
2.2 MapReduce
MapReduce is a programming model for data processing. It works by breaking
the process into two phases: the map phase and the reduce phase [18, 19]. The
two main components of MapReduce are:
– TaskTracker: As the slave, it receives the mapper and reducer task from
JobTracker and returns the results to the JobTracker after execution.
Hadoop is highly fault tolerant. In order to prevent any failure in the process,
HDFS creates multiple copies of data through the blocks, 3 copies by default.
NameNode can detect any failure in DataNodes or blocks and JobTracker also
can detect any failure of TaskTrackers and will replace them.
2.3 NoSQL
”NoSQL” Stands for Not Only SQL. The term ”NoSQL” was used by Carlo
Strozzi for the first time in 1998 [22]. It is a non-relational database [27]. One
of the aspects of NoSQL is its ability to handle database analytics of big data
sets in parallel and distributed platforms like Hadoop on commodity hardware.
Hive and Hbase are types of NoSQL applications on top of Apache Hadoop file
system. NoSQL databases can handle unstructured data such as text files, log
files, email, social media and multimedia. Horizontal scaling is one of the most
important features of NoSQL databases, and allows us to add more nodes to our
distributed system. Vertical scaling only allows to increase the power of existing
machine [23, 24].
2.4 Hive
Hive [19], [25] is a data warehousing infrastructure on top of Hadoop and HDFS.
HiveQL which is a SQL-like language, simplifies querying of unstructured large
datasets in distributed storage. Hive is designed to write once and read several
times. Real-time queries and row-level update are not possible. Hive is easy
to implement for everybody who is familiar with SQL queries. Facebook Data
Infrastructure Team started to create Hive in January 2007 to bring the familiar
concepts of tables, columns, partitions and a subset of SQL to the unstructured
world of Hadoop and it was open sourced in August 2008 [26]. Hive support
Bitmap Index from version 0.08.
column Grade having low cardinality. In this case our index has the same num-
ber of rows and the number of columns is equal to the number of distinct values
in column Grade. In table 1, cardinality of the column Grade is 4 because we
have 4 different values in it.
2.6 Hbase
Hbase [27] is a type of NoSQL database. It is an open-source, distributed,
column-oriented and scalable database built on the top of the Hadoop file sys-
tem. It is designed for random, real-time read/write access to very large tables
with billions of rows and millions of columns on commodity hardware.
3 Our Methodology
We have downloaded all complete Bacterial genomes from the National Cen-
ter for Biotechnology Information (NCBI) database [30]. The total number of
genomes was 2773 bacterial species and subspecies at the time (16.01.2014).
3.1 Insignia
Insignia is a pipeline to generate unique DNA signatures and it is also a database
and web application for obtaining DNA signatures. It contains 11274 viruses/
phages and 2653 non-viruses signatures with a length between 18 to 500 bp.
Insignia detect signatures for designing primers in Polymerase Chain Reaction
(PCR) and probes in micro-array technologies. The signatures can also be used
for real-time identification of species in microbial and viral assays [15, 16], [31].
We downloaded DNA signatures for two groups of 50 bacteria from the in-
signia database. As we are in the testing process, we just downloaded the signa-
tures with length of 18 bp. As an example, Table 2 consists of the head part of
Acholeplasma laidlawii DNA signatures.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 7
Insignia V0.7
Signatures calculated: Thu Mar 6 2014 10:58:06
Reference Organism:
Acholeplasma laidlawii PG-8A
Target Organism(s):
Signatures:
Index Start stop sequence
63965 451703 451720 ACATAAGCAGGTGCGGAA
63966 670606 670623 GATACCAATACCGCAGAT
63967 692909 692926 CCCATTCAACTTCGATCA
63968 530281 530298 ATCAACGCTAGATGAGCA
63969 268209 268226 ATGGAGGAGTCTGGATAC
63970 69763 69780 ACAGCAACAGCGTATATC
63971 357337 357354 GTGTTAGCGTTAAGTCTG
63972 1001550 1001567 TAGCCTCTTTAAGCAGGT
63973 1366201 1366218 ATGATGCAAGTGGCATGG
63974 1141698 1141715 TGCAACGGATGCATCAAG
3.2 Metasim
Metasim is a sequencing simulator application for genomics and Metagenomics
studies. It can be a great help to develop and improve Metagenomics tools, and
for planning Metagenomics projects [32, 33]. Metasim can simulate the short
reads of Roches 454 pyrosequencing, Sanger sequencing and Empirical sequenc-
ing technology. In this paper, we use Roches 454 pyrosequencing simulation.
The output of Metasim is a compressed file containing the short reads of a
bacterial chromosome or one of its Plasmids and their information.
>r16.1|SOURCES={GI=11497281,bw,1947919816}|ERRORS={8 1:C,46:
,135 1:T,160 1:A,190 1:G}|SOURCE 1=”Borrelia burgdorferi B31 plasmid cp32-8”
(44840ff90be8dcf7b704d6908ca095d559d2949e)
TTTAGGATTCGTACCCGTTTTCTTCTAATTTTTTCCTAGTGTTGTATGAATTT
CTTTTAATTTTTTTTGTTTTTCTTTCATGCAAGATTTTTTTATATTGAATTTT
TTTATTAGGGCAATTTCATTTTGTTTTAAGTATATTTATTGCCTCAATCTTAG
TATACTTTATCAATATTTAAATACAAAATAGAAAGGAGCTTCTTCCGTTTTAA
AGTTACAATTATTGAAATAATTTCTTAGTTGATATTTTTCTATTTCTTTAATC
TTTCTTTCTTCTTTTATATTATTTTTATTA
We chose 100 bacterial genomes from NCBI data set for simulating the short
reads. The first group of 50 bacteria from Insignia database are common in 100
chosen bacterial genomes and the other group is from some other bacteria apart
from these 100.
8 R. Karimi et al.
Table 3. An example for our index tables; each column of these tables is kept as a
single file
Another way is to keep every bacteria as a column. We store ’1’ if any signature
of the bacteria exists in a short read, ’0’ if not. In this case the table is much
smaller. The number of columns is equal to the number of bacteria plus two more
columns, one for row identification and the other for short reads. The number
of rows is equal to the number of short reads.
We can easily use Linux paste command to put all the files together as a
single file. As an example, in Table 4 we have 6 files. One file contains the reads
and their identification numbers and the other five contain ’0’ and ’1’ for five
bacteria.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 9
1 R1 0 0 0 1 0
2 R2 1 0 0 0 0
3 R3 0 0 0 1 0
4 R4 0 0 0 0 0
5 R5 0 0 1 0 0
6 R6 0 0 0 0 1
7 R7 1 0 0 0 0
8 R8 0 0 0 0 0
9 R9 0 0 0 0 0
10 R10 1 0 0 0 0
Then, we should create our table in Hive according to the newFile.txt struc-
ture.
hive> CREATE TABLE testTable1 ( rid INT, reads STRING, b1 INT, b2 INT
, b3 INT, b4 INT, b5 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
’\t’ STORED AS TEXTFILE;
hive> INSERT OVERWRITE LOCAL DIRECTORY ’/path to local dir for output’
select testTable1.rid from testTable1 where b1=1;
The output file contains the Rowid numbers of the short reads. As an example,
for the bacteria b1 in table 4 we have 2,7,10 which means, the signatures of
bacteria b1 are in these 3 short reads. Moreover, we have bacteria b1 in the
Metagenome sample.
In this approach, we have to repeat the query for all bacteria one by one or
write a long query and a long command to create the table. Hence, the bigger
the number of bacteria, the longer the implementation.
There is a better solution to prevent repeating the queries or writing long
commands and queries. We can add all bacterial files with ’0’ and ’1’ one after
the other and create a single column in a file with the cat command.
For big number of bacteria, we can use bash script in the incremental order
to add as much bacteria as we need at the end of each other quickly.
In this method, we need also to repeat short reads in a single column as much
as the number of bacteria. For instance, if we have 500,000 short reads and 1000
10 R. Karimi et al.
bacteria, then we should repeat short reads in one column 1000 times with the
cat command and the total number will be 500,000,000.
Next, we need to create a table with 3 columns (rid INT, reads STRING, b
INT) and run the query just once. The results will be in one column. We can
easily extract the information with Rowid numbers. It leads to a larger file size,
but a faster implementation. After getting the results, we can delete these large
tables.
We created testtable1 with 52 columns (rid INT, reads STRING, b1 INT,...,
b50 INT) and testtable2 with 3 columns (rid INT, reads STRING, b INT)
in Hive. We have short reads of 100 bacteria and two groups of 50 bacterial
signatures.
As we are in the testing process and we use Java programming without Hadoop
and MapReduce for searching signatures in the short reads to create our index
files (tables), we chose only 10% of short reads randomly.
Our future work is defining MapReduce in our Java program and using multi-
node cluster Hadoop in order to speed up this step.
We used the awk command to add Row identification (Rowid) to the file
contains short reads (132,705).
awk ’BEGIN{i=1} {if($0 !~ /^$/) {printf ("%d\t%s \n",i,$0); i++}
else { print $0} }’ reads.txt >> readsid.txt
We merged this file and all the 50 index files with the paste command into
a single file and load this file in the testtable1 in Hive. Then, we used queries
to search our table. We have done this process for both groups of 50 bacteria.
For the second table (testtable2) we attached all 50 bacteria in order as
one column in a single file and also repeat the short reads 50 times in a sin-
gle column, both with the cat command. Then, we added Rowid to the short
reads (6,635,250) and finally paste these three columns in a file and load it to
testtable2. In this case, we only need one query to get the results. It can be a
good test to see the speed and efficiency of Hive to search millions of rows with
Bitmap Index techniques.
There is a possibility of integrating Hive and Hbase. This feature allows Hive
QL statements to access HBase tables for both read (SELECT) and write (IN-
SERT). It is even possible to combine access to HBase tables with native Hive
tables via joins and unions [34]. Real-time reading and writing is possible in
Hbase. These features help us update and have faster implementation.
4 Experimental Study
All these implementations are done by Intel dual-core CPU and 4 GB of RAM,
Ubuntu 13.10, single-node-cluster Hadoop-1.2.1 and Hive-0.11.0. We can see the
elapsed time for our first implementation on testtable1 with 52 columns and
132,705 rows and the loaded file size of 44.6 MB as given in Table 5. We repeated
the query for all 50 columns. We did not consider the time for changing and
repeating the queries.
BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species 11
Table 5. Time taken for running the Hive query on testtable1 columns. Total time is
1065.927 Sec.
b1: 25.543 Sec. b11: 21.356 Sec. b21: 22.065 Sec. b31: 21.089 Sec. b41: 21.081 Sec.
b2: 22.236 Sec. b12: 21.120 Sec. b22: 21.116 Sec. b32: 22.013 Sec. b42: 21.016 Sec.
b3: 22.224 Sec. b13: 21.123 Sec. b23: 21.017 Sec. b33: 21.074 Sec. b43: 22.062 Sec.
b4: 21.187 Sec. b14: 22.065 Sec. b24: 20.977 Sec. b34: 20.991 Sec. b44: 21.000 Sec.
b5: 22.322 Sec. b15: 21.090 Sec. b25: 21.062 Sec. b35: 21.277 Sec. b45: 21.036 Sec.
b6: 21.167 Sec. b16: 21.083 Sec. b26: 21.057 Sec. b36: 20.010 Sec. b46: 21.009 Sec.
b7: 20.049 Sec. b17: 21.048 Sec. b27: 21.188 Sec. b37: 20.986 Sec. b47: 21.002 Sec.
b8: 20.048 Sec. b18: 21.123 Sec. b28: 21.108 Sec. b38: 22.063 Sec. b48: 20.997 Sec.
b9: 21.091 Sec. b19: 21.003 Sec. b29: 20.991 Sec. b39: 21.110 Sec. b49: 21.029 Sec.
b10: 22.373 Sec. b20: 21.072 Sec. b30: 21.081 Sec. b40: 20.952 Sec. b50: 22.136 Sec.
This implementation was for the first group of 50 bacteria which are common
in 100 bacterial samples. As we expected, we could find some short reads con-
taining the signatures for every bacteria. The number of short reads is a range
between 1 for b16 to 812 for b4.
As we expected, for the second group of 50 bacteria which differs by 100
samples, we could not find any short reads containing the signatures. The average
time taken for the implementation was almost the same as the first group.
Computational times for the second implementation on testtable2 with 3
columns and 6,635,250 rows and the loaded file size of 1.6 GB are:
File Size: 1.6 GB
Loading data to testtable2
Time taken: 45.588 seconds