Matthias Hochsmann - The Tree Alignment Model: Algorithms, Implementations and Applications For The Analysis of RNA Secondary Structures
Matthias Hochsmann - The Tree Alignment Model: Algorithms, Implementations and Applications For The Analysis of RNA Secondary Structures
Matthias Hochsmann - The Tree Alignment Model: Algorithms, Implementations and Applications For The Analysis of RNA Secondary Structures
RNA
= A, C, G, U is the RNA alphabet consisting of the bases Adenin,
Cytosin, Guanin and Uracil. Sequences or equivalently strings, or words are
written by juxtaposition of characters. In particular, let denote the empty
8 Introductory Material
character, also referred to as the gap character which acts as the neutral
element of the juxtaposition, i.e. a = a = a. The set
of strings over
is dened by
=
_
i0
i
,
where
0
= and
i+1
= aw [ a , w
i
. The empty sequence
that contains no characters or only gap characters is denoted by . I dene
the tuple alphabet as
n
= (a
1
, a
2
, . . . , a
n
) [ a
1
, a
2
, . . . , a
n
. For some
n
,
i
identies the ith component of . The symbols a, b, c, d refer to
characters and S, S
1
, S
2
, . . . , S
n
to sequences, unless stated otherwise.
The length of a string S, denoted by [S[, is the number of characters in
S. I make no distinction between a character and a string of length one. If
S = uvw for some (possibly empty) strings u, v and w, then
u is a prex of S,
v is a substring of S, and
w is a sux of S.
A prex or sux of S is proper if it is dierent from S. S[i] is the i-th
character of S. S[i, j] is the substring of S beginning at S[i] and ending at
S[j]. If i > j, then S[i, j] is the empty string.
2.1.3 Trees and Forests
Generally, a tree is an acyclic connected graph. I consider rooted, ordered,
node-labeled trees, called trees for short. A distinguished node, the root node,
imposes a partial ancestor-descendant relation on the tree nodes. Naturally,
each path beginning at the root node whereas a node can be visited at most
once ends in some node where it can not be further extended, a leaf node.
A node v is a descendant of a node w, if v appears after w on such a path.
Conversely, w is an ancestor of v. If v and w are directly connected by an
2.1 Preliminaries 9
edge, w is the parent of v and v is a child of w. Two nodes are siblings if
they have the same parent node. The last common ancestor of v and w,
denoted by lca(v, w), is the node p that is an ancestor of v and w such that
there is no descendant of p that satises the condition of being ancestor of v
and w. A tree is ordered if the order among sibling nodes matters, i.e. there
exists an order relation for each set of sibling nodes. An ordered forest is a
sequence of trees, called forest for short. A function label assigns a character
from some alphabet to each node in a forest. I use T () and T() for
the set of -labeled trees and forests, respectively. The empty tree and the
empty graph which contain no nodes are denoted by . Where convenient, I
identify a tree with the forest containing only this tree.
Since a tree is a special case of a forest, I give the following denitions
in terms of forests: Let F be a forest. V (F) denotes the set of nodes in F.
The size of F, denoted by [F[, is the number of its nodes. The number of
leaf nodes is referred to as leaves(F). The length of the longest path from
a root to a leaf is the depth of F, denoted by depth(F). The preorder index
of a node in a tree is its position in the sequence of nodes that is obtained
by the following procedure: First, visit the root node. Second, apply this
procedure recursively to the trees induced by the children nodes according to
their left-to-right order. For forests, the preorder index is dened by the same
procedure assuming a virtual root node that is not counted in the indexing.
pre
F
(v) denotes the preorder index of node v in F.
I now give denitions of substructures in trees and forests: A subtree at
node v of F consists of node v and all its descendants. Two subtrees are
siblings if their root nodes are siblings. A subforest is a sequence of sibling
subtrees. A tree pattern is a subtree T
can
be removed.
2.1.4 The Sequence Edit Distance
A fundamental model for approximate string comparison is the model of edit
distance [113, 171, 213]. It measures the distance between strings in terms
10 Introductory Material
of edit operations, that is, deletions, insertions, and replacements of single
characters. Two strings are compared by determining a sequence of edit
operations that converts one string into the other and minimizes the sum of
the costs of edit operations. Nowadays, the edit distance between strings is
basic knowledge in computational biology and is an integral part of numerous
textbooks, lectures and seminars. I give a brief introduction based on [108].
The notion of edit operations is the key to the edit distance model. I dene
the alignment alphabet
n
= ()
n
n
.
An edit operation is a pair (, )
2
1
. . .
h
.
2.1 Preliminaries 11
Note that the unique alignment of and is the empty alignment, that
is, the empty sequence of edit operations. An alignment is usually written
by placing the characters of the two aligned strings on dierent lines, with
inserted dashes - denoting . In such a representation, every column
represents an edit operation.
The alignment A = ( d, b b, c a, d, a a, c , d d) of
the sequences u = bcacd and v = dbadad is written as follows:
_
- b c - a c d
d b a d a - d
_
The notion of optimal alignment requires some scoring or optimization
criterion. This is given by a cost function.
A cost function assigns to each edit operation , ,= a positive
real cost ( ). The cost ( ) of an edit operation is 0. If
( ) = ( ) for all edit operations and , then is
symmetric. is extended to alignments in a straightforward way: The cost
(A) of an alignment A = (
1
1
, . . . ,
h
h
) is the sum of the costs of
the edit operations A consists of. More precisely,
(A) =
h
i=1
(
i
i
).
The unit cost function scores zero for matches and score one otherwise. The
edit distance of S
1
and S
2
, denoted by
SE
(S
1
, S
2
), is the minimum possible
cost of an alignment of S
1
and S
2
. That is,
SE
(S
1
, S
2
) = min(A) [ A is an alignment of S
1
and S
2
. (2.1)
An alignment A of S
1
and S
2
is optimal if (A) =
SE
(S
1
, S
2
). Note that there
can be more than one optimal alignment. If satises the mathematical
axioms of a metric, then
SE
is a metric.
12 Introductory Material
2.2 Primary, Secondary and Tertiary Struc-
ture of RNA
RNA molecules can be formally described on dierent levels of abstraction.
In messenger RNA (mRNA), coding regions of RNA molecules determine
the sequence of amino acids in proteins which in turn determines the pro-
tein structure. This information, the primary structure of an RNA molecule,
is carried as a sequence of nucleotides (bases) over the four letter alphabet
A, C, G, U. RNA molecules have the tendency to form a three dimensional
conformation, the tertiary structure. By folding back onto itself, an RNA
molecule forms structure, stabilized by the forces of hydrogen bonds between
certain pairs of bases, and dense stacking of neighboring base pairs. These
base-pairs GC, AU and GU, in order of their strength, are denoted canonical
base-pairs. In fact, almost every other base-pair combination could exist,
and has been observed in nature, but their contribution to the stability of
the molecule are minor in comparison with the canonical base-pairs. Exter-
nal factors like cellular RNAs and proteins do also inuence the structure.
Crystallographic studies by X-ray diraction and nuclear magnetic resonance
(NMR) can reveal the tertiary structure of an RNA molecule with high ac-
curacy [89, 100]. Although great progress has been made, crystallographic
studies are still time consuming and expensive. Moreover, tertiary structure
eludes from ecient algorithms for structure prediction and comparison. In
particular, these problems are reported to be NP-hard for tertiary structures
[94, 122]. From a biological viewpoint, RNA tertiary structure is likely formed
hierarchically. First, stable stems are formed and afterward tertiary inter-
actions are built. The strength of additional tertiary interactions is thought
to be too small to signicantly change the secondary structure conformation
[13, 152, 156, 202]. For economical, biological and computational reasons, a
subset of tertiary structures, the RNA secondary structures [36, 50], draw
researchers attention.
An RNA secondary structure (S, P) consists of a sequence S
RNA
and
2.3 Representation and Visualization of RNA Structures 13
a set of base-pairs P = (i, j) such that i, j [1, . . . , [S[] and i < j. For all
(i, j), (i
, j
,
1. i = i
j = j
< j
, j
), or
(b) i < i
< j
, j
).
(S, P) is a tertiary structure if Condition 1 is satised. Figure 2.1 shows
an example of the primary, secondary, and tertiary structure of an RNA
molecule.
An intermediate between secondary and tertiary structures are pseudo-
knotted structures which consider certain kinds of tertiary interactions. This
is an emerging eld but nowadays there is still a lack of algorithms and
Bioinformatics tools that handle pseudo-knotted structures eciently.
2.3 Representation and Visualization of RNA
Structures
Understanding the macromolecular structure of an RNA molecule and its
relation to function still requires expert knowledge and intuition from biol-
ogists. Visualization of RNA structures is a preliminary for this task. The
topology of an RNA molecule is relevant to classify RNA structures or to
search for structurally homologous RNA molecules. This typically involves
the visualization of secondary and pseudo-knotted structures. A visualiza-
tion of tertiary structures, based on the relative position of atoms, obtained
by NMR spectroscopy or X-Ray diraction, can give insights into macro-
molecular mechanisms.
The most common and biological informative drawing of RNA secondary
structures is a 2d-plot, sometimes referred to as squiggle-plot. Embedded in
a plane, paired bases are drawn adjacent to each other. Base-pair bonds
14 Introductory Material
G
C
G
G
A
U
U U
A
G
C
U
C
A
G
D
D
G
G
G
A
G
A
G
C
G
C
C
A
G
A
C
U
G
A A
.
A
.
C
U
G
GA
G
G
U
C
C U G U G
T
.
C
G
A
U C
C A C A G
A
A
U
U
C
G
C
A
C
C
A
Variable
Loop
Anticodon
Loop
TC
Loop
10 15 20 25 30 35 5 40 45 50 55 60 65 70 75
Anticodon
Loop
Acceptor
Stem
GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA 5 3
Secondary Structure Tertiary Structure B C
Primary Structure A
Acceptor
Stem
TC
Loop
Y
65
60
55
40
10
20
15
5
70
75
25
30
35
45
50
D Loop
3
5
5
3
D Loop
Figure 2.1: [54] Primary, secondary and tertiary structures of yeast phenylala-
nine tRNA. A: The sequence was obtained from The Genomic tRNA Database
[116, 117]. B: The secondary structure was inferred from an alignment of yeast
tRNA-PHE sequences by RNAalifold [82], circled bases indicate neutral mutations
with respect to the displayed secondary structure. Pseudo-knots and non-canonical
base-pairs are indicated with a dashed line connecting squared bases [188]. C: A
cartoon representation of tRNA tertiary structure, based upon tertiary structures
obtained from the Protein Databank Bank (ID 6TNA,1EHZ) [99, 182].
and the backbone of an RNA molecule are indicated as lines that do ideally
not intersect. Several layout algorithms that generate 2d-plots have been
proposed in [14, 109, 143, 179, 234]. The RNAViz [33, 34] software allows a
manual ne tuning of drawings for producing publication-quality secondary
structure drawings, e.g. the display of structural elements such as pseudo-
knots or unformatted areas is possible. RNA d2 [153], RNAdraw [132] and
XRNA [233] are alternative tools within this scope. Recently, a layout algo-
rithm for pseudo-knotted structures that produces non-overlapping drawings
was proposed which is implemented in the tool Pseudoviewer [72, 74]. The
visualization of the three dimensional structure of an RNA molecule belongs
to the general eld of three dimensional macromolecule visualization. Beside
2.3 Representation and Visualization of RNA Structures 15
freely distributed software like RasMol [175], there are many commercial
tools that oer visualization of macromolecules in the framework of drug
discovery.
Although 2d-plots are pleasant to read, it is dicult to compare them
or extract topological information. The dome-, circle- and mountain-plot
address this problem. In a dome-plot, base-pair bonds are drawn as arcs
above the sequence which is drawn as a straight line. In a circle-plot, the
sequence is arranged as a circle and chords inside the circle connect base-
pairs [151]. The mountain-plot draws the mountain-function of an RNA
secondary structure which intuitively assigns to each nucleotide the number
of base-pairs that enclose it [87]. Formally, we dene the mountain-function
for an RNA secondary structure (S, P) as follows:
h(0) = 0
h(i) =
_
_
h(i 1) + 1 if (i, j) P for some i [i, [S[]
h(i 1) 1 if (i, j) P for some j [1, [S[]
h(i 1) otherwise
where i > 0
(2.2)
A more technical representation are RNA secondary structure strings, for
their exhaustive use in the Vienna RNA Package referred to as Vienna
strings [84]. Vienna strings are sequences where, in order of the primary
structure, the characters ( and ) denote the 5
and 3
bases of a base-pair,
respectively, while . denotes an unpaired base. In addition, a second string
can hold the primary structure information. Vienna strings and Zuker-CT
les of the mfold software [244] are the most common formats to electroni-
cally store RNA secondary structures. In the era of web services, RNAML
is a suggestion of a XML based standardization which is designed for the
transmission of information among the RNA community [227]. An example
of RNA secondary structure drawings and representations is given in Figure
2.2.
16 Introductory Material
UGGAAGAAGCUCUGGCAGCUUUUUAAGCGUUUAUAUAAGAGUUAUAUAUAUGCGCGUUCCA
.(((.((((((.....))))))....((((.((((((.......)))))).))))..))).
(a) Vienna string.
U
G
G
A
A
G
A
A
G
C
U
C U
G
G
C
A
G
C
U
U
U
U
U
A A
G
C
G
U
U U
A
U
A
U
A
A
G A
G
U
U
A
U
A
U
A
U
A
U G
C
G
C
G
U
U
C
C
A
(b) 2d-plot generated by RNAplot from
the Vienna RNA Package [84].
UGGAAGAAGCUCUGGCAGCUUUUUAAGCGUUUAUAUAAGAGUUAUAUAUAUGCGCGUUCCA
(c) dome-plot.
1
61
5
10
5
20
25
30
35
40
45
50
55
U G
G
A
A
G
A
A
G
C
U
C
U
G
G
C
A
G
C
U
U
U
U
U
A
A
G
C
G
U U U
A
U
A
U
A
A
G
A
G
U
U
A
U
A
U
A
U
A
U
G
C
G
C
G
U
U
C
C
A
(d) circle-plot generated by the mfold
server [244].
(e) mountain-plot. Hairpin loops ap-
pear as at tops, interior loops and
bulges as intermediate plateau, helices
as sloping hillsides, and branching re-
gions as valleys.
Figure 2.2: Visualization of a secondary structure for the Nanos 3 UTR trans-
lation control element taken from the Rfam database [67] (Id: RF00161, EMBL
Id: U24695.1).
2.3 Representation and Visualization of RNA Structures 17
These visualization show a single structure of an RNA sequence. Dot-plots
visualize the structure space of an RNA sequence with the potential to reveal
suboptimal structures that are biologically relevant. Arranged in a matrix,
the probabilities of base-pairs are plotted as dots whose diameter is propor-
tional to their probability in the structure space. The base-pair frequency
information has subsequently been included in single structure visualizations
and likely base-pairs can be distinguished from unlikely base-pairs by a color
gradient or some other indicator [245]. See Figure 2.3 for an example of a
dot-plot and an annotated 2d-plot. RNAmovies is an interactive software for
the visualization of secondary structure spaces [57]. It automatically gener-
ates animated 2d-plots where structures are morphed to explore the structure
space of an RNA molecule.
From the viewpoint of computer scientists, RNA secondary structures are
often represented as trees or forests. The parent and sibling relationship of
nodes is determined by the nesting of base-pair bonds. The 5
to 3
nature
of an RNA molecule imposes the order among sibling nodes. This produces
a forest structure in general but a virtual root node can always turn a for-
est into a tree. Dierent tree representations that vary in their resolution
have been proposed. A tree structure where base-pairs correspond to inter-
nal nodes while unpaired bases correspond to leaves in the tree was proposed
in [173]. I refer to it as the natural tree representation. A coarse grained
tree representation where nodes correspond to the structural components -
stacking regions, hairpins, bulges, internal loops and multiloops - was pro-
posed in [110, 178, 180]. Parse trees of grammar based prediction strategies
for RNA secondary structures represent the structure such that the sequence
information corresponds to the preorder sequence of leaves while the internal
nodes correspond to productions of the grammar [167]. An example of tree
representations of RNA structures is shown in Figure 2.4.
18 Introductory Material
U24695
U G G A A G A A G C U C U G G C A G C U U U U U A A G C G U U U A U A U A A G A G U U A U A U A U A U G C G C G U U C C A
U G G A A G A A G C U C U G G C A G C U U U U U A A G C G U U U A U A U A A G A G U U A U A U A U A U G C G C G U U C C A
U
G
G
A
A
G
A
A
G
C
U
C
U
G
G
C
A
G
C
U
U
U
U
U
A
A
G
C
G
U
U
U
A
U
A
U
A
A
G
A
G
U
U
A
U
A
U
A
U
A
U
G
C
G
C
G
U
U
C
C
A
U
G
G
A
A
G
A
A
G
C
U
C
U
G
G
C
A
G
C
U
U
U
U
U
A
A
G
C
G
U
U
U
A
U
A
U
A
A
G
A
G
U
U
A
U
A
U
A
U
A
U
G
C
G
C
G
U
U
C
C
A
(a)
U
G
G
A
A
G
A
A
G
C
U
C U
G
G
C
A
G
C
U
U
U
U
U
A A
G
C
G
U
U U
A
U
A
U
A
A
G A
G
U
U
A
U
A
U
A
U
A
U G
C
G
C
G
U
U
C
C
A
(b)
Figure 2.3: (a) shows the base-pair probabilities as predicted by RNAfold [84].
The lower triangle show only the bases included in the minimum free energy struc-
ture and the upper triangle contains the full base-pair probabilities where the dia-
meter of a square is proportional to the probability of the corresponding base-pair.
(b) shows the 2d plot of the structure annotated with the probabilities of a base-
pair. The colors range from blue to red in correspondence to less and high frequent
base-pairs.
2.3 Representation and Visualization of RNA Structures 19
U
G
G
A
A
G
A
A
G
C
U
C U
G
G
C
A
G
C
U
U
U
U
U
A A
G
C
G
U
U U
A
U
A
U
A
A
G A
G
U
U
A
U
A
U
A
U
A
U G
C
G
C
G
U
U
C
C
A
(a)
v
U GC
GC
AU
A GU
AU
AU
GC
CG
UA
C U G G C
U U A A GC
CG
GC
UG
U UA
AU
UA
AU
UA
AU
A G A G U U A
U
G U
A
(b)
SR
M
H SR
IL
H
(c)
v
U SR
G SR
G SR
A ML
A SR
G SR
A SR
A SR
G SR
C SR
U CUGGC A
G
C
U
U
U
UUAA SR
G SR
C SR
G SR
U IL
U SR
U SR
A SR
U SR
A SR
U SR
A AGAGUUA U
A
U
A
U
A
U
C
C
G
C
GU
C
C
U
A
(d)
Figure 2.4: (a) shows a secondary structure with colored components that indicate
the relation between the representations. (b) shows the natural tree representation
where internal nodes correspond to base-pairs and leaves correspond to unpaired
bases. (c) shows the coarse grained tree representation. The red and cyan part are
stacking region (S), the green part is a multiloop (M), the yellow part is an internal
loop (I), and the blue and magenta parts are hairpin loops (H). A bulge left (L)
and a bulge right (R) are internal loops that have only a left and right unpaired
region, respectively. Note that single stranded regions at the root level of the tree
and in multi-loops are omitted in this tree representation. (d) shows a simplied
parse tree for some grammar describing RNA secondary structures. The internal
nodes correspond to productions of the grammar and impose a structure on the
sequence that resides at the leaves. A virtual root node v is added in (b) and (d)
to guaranty a tree structure.
20 Introductory Material
2.4 RNA Secondary Structure Prediction
The structure of an RNA molecule can be crucial for its function (see Section
1.1). Accordingly, the automatic prediction of RNA structures from sequence
information is an important problem. Today, there are two prediction strate-
gies:
Thermodynamic approaches: The conformation of paired and un-
paired regions in an RNA structure can be associated with an energy
value. Given some energy model, thermodynamic approaches nd the
energetically most stable structures among all possible secondary struc-
tures of an RNA sequence. Such a structure is denoted the minimum
free energy (mfe) structure.
Comparative approaches: In functional non-coding RNA, the struc-
ture of an RNA is conserved during evolution. Since a base-pair can
be formed by dierent combinations of nucleotides, dierent sequences
can have the same or a similar structure. If a family of structural
homolog RNA molecules has a sucient amount of sequence conserva-
tion, a multiple sequence alignment can emphasize regions of sequence
variation. The regions containing structure-neutral mutations, denoted
as compensatory base changes, give clues to the structure of an RNA
molecule.
In 1978, Nussinov et al. introduced a rst folding algorithm requiring a single
sequence as input [151]. They determine the structure that maximizes the
number of possible base-pairs for an RNA sequence. This problem is also
known as the maximum circular matching problem. The incorporation of
thermodynamics in this model assumes that the energy contribution of each
base-pair is independent from adjacent base-pairs in the structure. This as-
sumption is not realistic since the stability of RNA molecules is based on
the stacking of base-pairs. Zuker & Stiegler proposed a dynamic program-
ming algorithm that calculates the minimum free energy structure based on a
2.4 RNA Secondary Structure Prediction 21
model that considers base-pair stacking and destabilizing loops [247]. Their
algorithm uses thermodynamic parameters of Tinoco et al. [201]. The en-
ergy model and parameters were rened in [129, 209]. McCaskill introduced a
statistically motivated model based on Boltzmanns distribution and thermo-
dynamic parameters, the partition function [133]. The most likely structure
under this model is the mfe structure. The main contribution of McCaskill is
the computation of probabilities for the individual base-pairs. Sakakibara et
al. and Eddy & Durbin invented a generalization of hidden markov models,
the stochastic context-free grammars, and formulated the RNA secondary
structure prediction problem in this context [42, 167, 168].
Thermodynamic folding relies on parameters that were measured in vitro
under xed conditions which is a simplication of real conditions. The fold-
ing in vivo takes place in a dynamic, hence, more complex environment.
From the inaccuracy of energy parameters (and even the model itself), it is
possible that the mfe structure is not the biological correct one. The bio-
logical relevant prediction is often a suboptimal solution that has an energy
close to the mfe structure. Thus, the generation of suboptimal structures is
important for the practical impact of prediction algorithms based on ther-
modynamics [243, 246]. The assumption of equilibrium folding pathways is
another common simplication of thermodynamic folding models. Studies of
the folding of the Tetrahymena group I intron gave insights in the complex-
ity of the folding process [8, 208]. It has been observed that RNA can fold
during transcription, the folding process happens on a wide range of time
scales, and ions and macromolecules guide the folding. Thus, the kinetics of
RNA folding are important to understand the true folding pathway. Models
and algorithms for kinetic folding prediction are provided in [49, 70, 138, 232].
A further challenge for mfe folding algorithms are RNA secondary structures
that are known to have two conformations depending on some environmen-
tal parameters, known as RNA switches [126]. Recently, Giegerich et al.
provided a structure prediction algorithm based on thermodynamics that
compartmentalizes the suboptimal solution space into dierent shapes [59].
22 Introductory Material
The dierent shapes of an RNA molecule give a compact overview of the
structure space and are useful nd the biological relevant prediction or to
detect dierent conformational states.
The most popular tools for energy-based RNA secondary structure pre-
diction from single sequences are mfold [243, 244] and RNAfold [79, 84]. The
former implements the mfe algorithm and the latter implement additionally
McCaskills partition function algorithm. Recently, the energy-based predic-
tion of pseudoknotted structures received more attention [35, 158, 161].
Comparative approaches require a set of homologous RNA sequences that
have a putative similar structure. The general idea is to exploit the covari-
ance that is expected to occur in aligned stem regions. Until the early 80s the
structural inference from homologous RNA sequences had been hand-crafted.
Noller & Woese described a procedure to detect compensatory changes in
helical elements [147]. An algorithm building upon this strategy was pro-
vided by Waterman [224, 225]. Given a multiple sequence alignment, the
mutual information content and sequence covariation are measures that help
to automatically identify conserved stem regions [26, 229]. These pure phy-
logenetic approaches assume that the sequences, in fact, share a common
structure, which requires a careful choice of sequences. A combination of
phylogenetic information and thermodynamics can further improve the re-
sults. A multiple sequence alignment is used to validate predicted structures
in [81, 112, 121]. Conversely, Han & Kim resolve ambiguities in the align-
ment by thermodynamics [73]. As an extension of the minimum free energy
approach, RNAalifold [83] calculates the best folding using an objective func-
tion that combines energy contributions and covariance. Ruan et al.s ILM
(iterated loop matching) optimizes a similar objective function [166]. As
the name suggests, the structure is iteratively constructed by adding non-
conicting stem regions. ILM is capable of returning pseudoknotted RNA
structures. Knudsen & Hein predict a common RNA secondary structure by
stochastic context-free grammars, implemented in the tool Pfold [104].
2.4 RNA Secondary Structure Prediction 23
Sanko opened a branch of comparative strategies considering the align-
ment and folding problem simultaneously [172]. The time complexity of
Sankos algorithm is O(n
6
) where n is the length of RNA sequences. This
is too high to be practical even for two sequences. Mathews & Turners
DYNALIGN restricts the maximum distance of possible base-pairs to bound
the parameters that aect time complexity in Sankos algorithm [130, 131].
Gorotkin et al.s FOLDALIGN implements a modication of Sankos algo-
rithm than does not allow branching structures, which reduces the time com-
plexity. Tabaska & Stormo used a graph theoretic approach, the maximum
weight matching to infer RNA secondary structures from dierent sources
[189, 190]. They consider a set of base pairing scores that can be derived
from a range of sources, such as free energy considerations, mutual informa-
tion, and experimental data. Hofacker et al. provide a strategy that is based
on aligning base-pair probability matrices, predicted by McCaskills partition
function algorithm [80]. Their algorithm is implemented in the tool pmmulti
in the Vienna RNA package.
According to Gardner & Giegerich, approaches that use phylogenetic in-
formation yield signicant better predictions than pure thermodynamic ap-
proaches [55]. However, the quality of the multiple sequence alignment that
should reveal the phylogeny depends on the degree of sequence homology of
RNA molecules. The minimum homology that is necessary depends on the
particular prediction strategy, i.e. the sources of information that are used to
predict structures. Moreover, phylogenetic approaches require a large num-
ber of sequences which is a rare situation.
For families of RNA molecules with low sequence conservation, a strat-
egy that was proposed by Shapiro and Konings & Hogeweg more than a
decade ago is currently revitalized [105, 180]: First, structures are predicted
based on thermodynamics and then a structural alignment, instead of a se-
quence alignment, is done. Recent progress in structural comparison models
and algorithms make this strategy a promising candidate for low sequence
homologous, but (putative) structural homologous RNA molecules. In par-
24 Introductory Material
ticular, this strategy requires a model for structurally aligning multiple RNA
secondary structures. I will provide a structure prediction strategy based on
multiple structure alignment in Chapter 5.
2.5 RNA Structure Comparison
The eld of RNA structure comparison emerged with the invention of RNA
secondary structure prediction algorithms. Since then, the resulting pool of
predicted structures, be they right or wrong, were available for analyzing
structural properties. The prediction of structural motifs, the inference of
a taxonomy based on structural similarity instead of sequence similarity,
and the prediction of consensus structures for a set of functionally related
RNA molecules are active research topics that involve the comparison of
RNA structures. I distinguish the following approaches to compare RNA
structures:
Base-pair distances: Base-pair distances are classical mathematical
metrics that operate on the base-pair sets of RNA structures.
Sequence alignment: RNA secondary structures are represented as
strings that in turn are compared in the sequence alignment model.
Edit distances between ordered rooted trees: Since an RNA
secondary structure can be represented as a tree, distances on trees
can be applied to compare RNA secondary structures.
Arc annotated sequences: Pure sequence alignment based approaches
are extended to incorporate structural constraints that are induced by
the structure of RNA. Constrained sequence edit models are generally
studied in the context of arc annotated sequences.
Graphs: Graphs can express any sort of RNA structures. Algorithms
for the classication of graphs are applied to RNA structure analysis.
2.5 RNA Structure Comparison 25
Distance and Similarity The result of a comparison of RNA structures
can be quantied in two dierent ways: The rst is distance and the second
is similarity. Distance measures satisfy the mathematical axioms of a metric
(or at least pseudo-metric). A similarity measure assigns a numeric value
to some pairs of structures such that the larger the value the more similar
the structures are. Distances are non-negative and the distance between two
structures is zero i the structures are equal. In contrast, the similarity
of equal structures is an arbitrary positive number. Accordingly, a small
distance is equivalent to a large similarity.
In the following sections, I consider distance versions of models for RNA
structure comparison. The corresponding similarity versions can be derived
easily for distances that are based on optimization problems. For distance
problems, optimal means minimal, while for similarity problems optimal
means maximal. Throughout this section, (S
1
, P
1
), (S
2
, P
2
), . . . , (S
n
, P
n
) de-
note secondary structures.
2.5.1 Base-pair Distances
Base-pair distances are distance measures that are dened on the base-pair
sets of RNA structures. An analysis of some properties of base-pair distances
and their comparison with the tree edit distance is provided in [142].
Symmetric Set Dierence
One of the simplest measures is dened by the symmetric set dierence, that
is:
SD
(P
1
, P
2
) = P
1
P
2
P
2
P
1
(2.3)
Clearly, this simple measure is sensitive to the exact position of base-pairs
and is therefore not suitable to compare structures of dierent length. Also
if the structures have the same length, the measure is sensitive for shifted
structures. Consider the following structures:
26 Introductory Material
P
1
= ............(((.....))).
P
2
= ...........(((.....)))..
Intuitively, these structures should obtain a distance close to zero, but
SD
(P
1
, P
2
) = 6 since there is no common base-pair. This discrepancy gets the
larger the larger the shifted structures are. Still, for suboptimal structures
of the same sequence,
SD
can be a useful ad hoc distance.
Hausdor Distance
A more exible metric is the Hausdor distance which was applied by Zuker
to lter out similar suboptimal foldings in the original mfold program [243].
The Hausdor distance measures the distance between non empty point sets
of some metric space. For the problem of RNA structure comparison, these
are the sets of base-pairs. Intuitively, the Hausdor distance between struc-
tures P
1
and P
2
is the maximum of the distances between all nearest base-
pairs connecting P
1
and P
2
. Formally, the distance between two base-pairs
(i, j) P
1
and (i
, j
) P
2
is dened as ((i, j), (i
, j
)) = max[ii
[, [jj
[.
The distance of a base-pair to a set of base-pairs is dened as ((i, j), P) =
inf
(i
,j
)P
((i, j), (i
, j
H
(P
1
, P
2
) = max(
asym
(P
1
, P
2
),
asym
(P
2
, P
1
)) where (2.4)
asym
(P
1
, P
2
) = sup
(i,j)P
1
((i, j), P
2
)
Although this distance behaves reasonable for structure shifts, the distance
between structures that dier only in one base-pair depends on the position
of this base-pair. Consider the following structures:
P
1
= ...........(((.....)))..
P
2
= (...)......(((.....)))..
P
3
= ....(...)..(((.....)))..
2.5 RNA Structure Comparison 27
P
2
and P
3
are both one base-pair apart from P
1
, but their Hausdor distance
is dierent, i.e.
H
(P
1
, P
2
) = 11 and
H
(P
1
, P
3
) = 7. Thus, isolated base-pairs
can lead to high distance values depending on the distance to the next base-
pair. Aware of this problem, Zuker et. al dened a variant of
H
that ignores
up to d bases to obtain a distance d [246]. This variant is a pseudo-metric,
since the triangle inequality is not satised as exemplied in [142].
Mountain Metric
Another application of a classical mathematical metric to RNA structures
is the mountain metric which is based on the l
p
-norm of the dierence of
two mountain functions h
P
1
and h
P
2
(see Equation (2.2)) of RNA secondary
structures of length n [142]:
p
M
(P
1
, P
2
) = |h
P
1
h
P
2
|
p
: =
p
_
n
i=1
[h
P
1
(i) h
P
2
(i)[
p
(2.5)
For p = 2 this is the root mean square (RMS) distance between two functions
which is, followed by p = 1, the most frequent choice. This metric is more
exible for shifted structures and isolated base-pairs and it can be computed
in linear time. A property of this distance that one must be aware of is that
the extension of stem regions does not have uniform costs. See the following
example:
P
1
= ..(((.....)))..
P
2
= .((((.....)))).
P
3
= ..((((...))))..
P
1
diers from P
2
and P
3
in just one base-pair but their mountain distances
(for simplicity I use p = 1) do not reect that. In particular,
1
M
(P
1
, P
2
) =
13 and
1
M
(P
1
, P
3
) = 5. See Figure 2.5 for an illustration. A variant of
the mountain distance that re-scales mountain functions for structures of
dierent length is proposed and applied in [44].
28 Introductory Material
sequence position
1
M
P
2
P
1
P
3
Figure 2.5: The dierence between P
2
and P
1
is larger than the dierence between
P
3
and P
1
, though both dier in exactly one base-pair.
2.5.2 Sequence Alignment
Shapiro and Konings & Hogeweg simultaneously proposed the idea to com-
pare RNA secondary structures by well established sequence alignment algo-
rithms [105, 180]. While Konings & Hogeweg focused on pairwise alignments,
Shapiro considered multiple sequence alignments. In both approaches, the
key idea is to use a string representation of RNA secondary structures, in
avor of the Vienna strings
1
, which are the data structures that are further
analyzed.
Konings & Hogewegs Encoding
Following Konings & Hogeweg, A full linear representation is obtained by
transforming the mountain structure into a linear array of symbols represent-
ing the direction of base-pairing at each of the single positions: upstream
pairing (>), downstream pairing (<) or single strandedness (+) . . . Extra
information in terms of secondary structure can be included in the linear re-
presentation by distinct coding of hairpin loops (^) and other types of single
stranded positions (+). In this representation, the secondary structure in
Figure 2.2 is written as:
+>>>+>>>>>>+++++<<<<<<^^^^>>>>+>>>>>>^^^^^^^<<<<<<+<<<<++<<<+
1
The Vienna format was established later, building upon the results of Shapiro and
Hogeweg & Konings.
2.5 RNA Structure Comparison 29
A potential disadvantage of this representation for a topological classication
is that basic secondary structure elements may be broken up in an alignment,
i.e. matching of individual parts of one helix to parts of two dierent helices,
not considering interruptions by internal loops and bulges.
Shapiros Encoding
Shapiro introduced a dierent string representation that circumvents this
problem. The coarse grained tree representation of an RNA structure is
transformed to a string by a left-to-right preorder traversal of the tree,
putting subtrees into brackets. The components are encoded as single letters.
In this representation the structure in Figure 2.2 is:
(S(M((H)(S(I(H)))))
To simplify the notation brackets are removed for non-branching subtrees:
(S(M(H (S I H))))
For a topological classication, this coarse grained representation is suitable.
However, if the aim is an improved sequence alignment that incorporates
structural constraints, it should be possible to match individual parts of one
helix with two dierent parts of another helix. For instance, there could
exist a larger helix that was broken during evolution resulting in two smaller
helices that are separated by a bulge.
Beside these eects, both methods suer from the same inherent problem:
A pair of brackets is not treated as a unit by a sequence alignment and
thus the tree nature of a secondary structure is not treated appropriately.
Consider the following structures:
P
1
= (((..(((....))))))
P
2
= (((......)))
The following alignment is among the optimal alignments given a scoring
scheme that favors matches in contrast to mismatches, insertions and dele-
tions.
30 Introductory Material
(((..(((....))))))
(((..---....)))---
The opening brackets ( are not aligned to its corresponding closing brackets
) and in terms of structure this alignment is not meaningful. Shapiro was
aware of such problems but appropriate, ecient algorithms for comparing
RNA secondary structures as trees were just about to emerge.
2.5.3 Edit Distances between Rooted Ordered Trees
From the tree nature of RNA secondary structures, every distance measure
on trees can be applied to RNA secondary structures. Inspired by the se-
quence edit distance [113, 171, 213], dierent edit models for trees have been
invented [95, 118, 177, 191, 198] which result in various algorithms. Beside
the fact that tree editing is a challenging theoretical problem dealing with a
fundamental data structure, this eld was (and is still) driven by the need
for such algorithms in a broad spectrum of applications. This includes the
comparison of RNA secondary structures [25, 110, 111, 178], the analysis of
structured documents and text databases [18, 96, 127, 144, 159], script recog-
nition [22, 118], ngerprint recognition [139], image analysis [165, 169], the
analysis of parse trees [97, 235], the comparison of assembly rules [48], and
the identication of common structural fragments among chemical structures
[192]. The semantic of tree edit distances in the scope of RNA structure com-
parison depends on the choice of the tree representation and the edit model.
A review of tree edit models that are particularly interesting for docu-
ment trees (but also for RNA secondary structures) was given in [7]. The
authors provide implementations of tree edit algorithms in the programming
language Turing [90]. A more recent survey on tree editing problems, in-
cluding unrooted, unordered variants, and dierent notions of tree editing,
was provided in [10, 11, 241]. The relation between tree-edit distances was
studied in [216] resulting in a hierarchy of edit-models.
In the world of sequences, the terms edit distance and alignment dis-
tance are used synonymously. For each optimal sequence of edit operations,
2.5 RNA Structure Comparison 31
an alignment achieving the same score can be constructed and vice versa.
However, on a conceptional level the models are dierent. While the edit
distance is an operational model of editing one sequence into another, an
alignment is a declarative model, a data structure rather than a process. In
the world of trees, these models turned out to be dual: The tree edit model
constructs a largest common subforest, while the tree alignment distance
constructs a smallest common supertree. Moreover, the higher complexity
of trees (in comparison to sequences) leads to a multitude of problems that
vary in the constraints that are imposed by the chosen model. The models
that are interesting for the comparison of RNA structures are introduced in
the following paragraphs, beginning with the most general model which is
successively restricted. Throughout this chapter, T, T
1
, T
2
are trees unless
stated dierently.
Tree Edit Distance
In the tree-to-tree correction problem [191], Tai introduced the generalization
of the string-to-string correction problem [213] which is also known as the edit
distance problem for strings. I refer to Tais model as the tree edit model
2
,
following the mainstream of literature.
Edit Operations The edit operations relabel , delete and insert generalize
from strings to trees (and forests) as follows:
relabel : The label of a node v in T is changed. If a label is relabeled
by itself, this is denoted a match.
delete: Deleting node v in T means that the children of node v become
the children of the parent node of v. Moreover, if v has any siblings,
the deletion preserves the preorder relation of these node. Note, if v is
the root node, the result is the forest consisting of the children nodes
of v.
2
The same model was also, independently, proposed by Lu [118]. However, Lu consid-
ered an algorithm for a special case of the general tree edit distance.
32 Introductory Material
a
b c
d e
f
T
1
a
b x
d e
f
T
2
a
b d e f
T
3
x
x
c x
c x
Figure 2.6: To simplify the illustration, a node and its label are identical. T
1
is transformed into T
2
, by relabeling c with x, which in turn is transformed into
T
3
by deleting x. Note that the edit operations can be applied in both directions.
T
2
results from T
3
by inserting x as a child of node a whereas the nodes d and e
become the children of x.
insert: This operation is complementary to delete. Inserting a new
node v into T results in a new tree T
making v
the parent of a consecutive subsequences of children of v
.
According to the sequence edit model, I represent edit operations by
where (, )
2
= T
n
and T
i
results from the application of e
i
to T
i1
for i [1, n]. Let be a metric dened on edit operations. The cost of
an edit-sequence E is the sum of the costs of its edit operations, that is:
(E) =
n
i=1
(e
i
) which is also a metric [240]. The edit distance
TE
be-
tween trees T
1
and T
2
is the minimum cost that is necessary to transform T
1
into T
2
:
2.5 RNA Structure Comparison 33
TE
(T
1
, T
2
) = min(E) [ E is an edit sequence transforming T
1
into T
2
.
(2.6)
Edit sequences are an intuitive, operational concept that accounts for the
dierences between trees. However, the innite number of edit sequences that
can transform one tree into another make theoretical observations intricate.
Again inspired by the sequence edit model, Tai extended the concept of traces,
known from the sequence edit model [213], to trees, commonly referred to as
mappings.
Mappings A mapping establishes a one-to-one correspondence of nodes in
T
1
and T
2
which preserves the sibling and ancestor relation of nodes. For-
mally, a mapping between trees T
1
and T
2
is dened by a triple (M, T
1
, T
2
)
where M V (T
1
) V (T
2
) such that for all (v
1
, w
1
), (v
2
, w
2
) M the follow-
ing holds:
v
1
= v
2
i w
1
= w
2
(one-to-one correspondence)
v
1
is ancestor of v
2
i w
1
is ancestor of w
2
(ancestor preservation)
pre
T
1
(v
1
) < pre
T
1
(v
2
) i pre
T
2
(w
1
) < pre
T
2
(w
2
) (sibling preservation)
Let V (T
1
) M and V (T
2
) M be the nodes in T
1
and T
2
that are not mapped
by M, respectively. The cost of a mapping is given by:
(M) =
(v,w)M
v w +
vV (T
1
)\M
v +
wV (T
2
)\M
w (2.7)
The following lemma shows that mapping are equivalent to edit-sequences.
Lemma 2.1. Given an edit-sequence E transforming T
1
into T
2
, there exists
a mapping from T
1
to T
2
such that
TE
(M)
TE
(E). Conversely, for any
mapping M, there exists an edit-sequence such that
TE
(E) =
TE
(M).
Proof. See Proof of Lemma 2 in [240].
34 Introductory Material
Hence, the edit distance between trees can be dened likewise by
TE
(T
1
, T
2
) = min(M) [ M is a mapping from T
1
to T
2
. (2.8)
Isomorphic Subforests A third denition of the edit distance between
trees is more related to graph theory. Forests F
1
and F
2
are isomorphic,
denoted by F
1
= F
2
if they can be transformed into each other simply by
applying the relabel -function. For isomorphic forests, there exists a corre-
sponding mapping M
i
including all nodes in F
1
and F
2
. Such a mapping M
i
is denoted an isomorphism. For some D V (T), T D denotes the forest
that results from applying the delete-function to all nodes in D to T. This
denition, allowing isomorphic subforests instead of isomorphic subtrees, is
important since a valid mapping between trees can correspond to an isomor-
phic subforest. The edit distance between T
1
and T
2
can then be dened
as
TE
(T
1
, T
2
) = min
TE
(M
i
) +
vD
1
v +
wD
2
w [
D
1
V (T
1
), D
2
V (T
2
) such that T
1
D
1
= T
2
D
2
. (2.9)
It is obvious that this denition is equivalent to the denition of a map-
ping (2.8) and, consequently, to the edit sequence based denition. Figure
2.7 shows an example of a mapping and the correspondence to isomorphic
subforests.
Algorithms Algorithms that calculate the tree edit distance generally build
upon the mapping concept since the number of mappings for given trees
is nite. The rst proposed algorithm is due to Tai and requires O([T
1
[
[T
2
[ leaves(T
1
)
2
leaves(T
2
)
2
) time and space. It follows the strategy of ex-
tending mappings from the root of a tree to its leaves. A faster and much
simpler algorithm is due to Zhang & Sasha (Zhang-Shasha Algorithm) and
improves the time complexity to O([T
1
[ [T
2
[ minleaves(T
1
), depth(T
1
)
2.5 RNA Structure Comparison 35
a
x
b c
d
T
1
a
b y
c d
T
2
a
b c d
T
3
Figure 2.7: The dashed lines indicate the mapping M =
(a, a), (b, b), (c, c), (d, d) of T
1
and T
2
. T
3
shows the maximum isomorphic
subforest (here a tree) that is obtained by deleting node x in T
1
and node y in
T
2
. The edit sequence x , y together with the sequence of trees T
1
, T
3
, T
2
determines the corresponding edit process.
minleaves(T
2
), depth(T
2
)) and the space complexity to O([T
1
[ [T
2
[) [240].
In the worst case, which is a tree that grows linear in the number of leaves
and its depth, the time complexity is in O([T
1
[
2
[T
2
[
2
). Special algorithms
for the tree edit distance under a unit cost scheme are studied in [181]. The
parallelization of tree edit algorithms is considered in [237, 239]. The average
runtime of the Zhang-Shasha Algorithm for RNA secondary structure trees
turned out to be O([T
1
[
3
2
[T
2
[
3
2
) which essentially means that it is cubic
[39]. Klein improved the worst case runtime of the tree edit algorithm to
O([T
1
[
2
[T
2
[ log [T
2
[) by applying a divide and conquer strategy (Kleins
Algorithm) [102]. An analysis of the Zhang-Shasha Algorithm and Kleins
Algorithm in a general framework of cover strategies is given by Dulucq
& Touzet [40]. Moreover, they present an improvement of Kleins strategy
which can result in a better practical runtime. A dierent strategy is fol-
lowed by Chen, the tree edit problem is reduced to a matrix multiplication
problem and is solved by using results in this eld [21]. This algorithm runs
in O([T
1
[ [T
2
[ +minleaves(T
1
)
2
[T
2
[ +leaves(T
1
)
2.5
[T
2
[, leaves(T
2
)
2
[T
1
[ +
36 Introductory Material
leaves(T
2
)
2.5
[T
1
[) and improves the time complexity for certain kind of
trees in comparison to Kleins algorithm, e.g. if one of T
1
and T
2
is thin and
deep.
Variants Touzet gave a denition of gaps in a tree [207]. The idea is to
consider contiguous gaps as a single large gap where the term contiguous is
equivalent to our denition of a tree pattern. They study convex scoring
functions for gaps, that is: gapscore(T
1
T
2
) gapscore(T
1
) + gapscore(T
2
)
where T
1
and T
2
are tree patterns and T
1
T
2
means that T
2
is attached to
a leaf node of T
1
. They proved that the calculation of the tree edit distance
with gaps for convex gap scores is a NP-hard problem.
Tree Alignment Distance
The tree alignment distance was introduced by Jiang et al. [95]. My cen-
tral notion is the following generic view of an alignment: An alignment
of two structures with labels from some alphabet is the same type of
structure with labels from the alignment alphabet
2
) is an alignment of trees T
1
, T
2
T () i
T
1
= (A[
1
) and T
2
= (A[
2
). (2.11)
3
See the denition of T D on Page 34.
2.5 RNA Structure Comparison 37
a
b
c d
f g
A[
1
a, a
b,
c, d, d
, e
f, f g, g
A
a
d
e
f g
A[
2
a
b
c d
f g
T
1
a
d e
f g
T
2
(A[
1
)
(A[
2
)
Figure 2.8: A is an alignment of T
1
and T
2
.
Note that this denition forbids elements of T (
2
vV (A)
(label (v)). (2.12)
The alignment distance between T
1
and T
2
is the minimum cost that an
alignment of T
1
and T
2
can achieve. An alignment of T
1
and T
2
is optimal if
it achieves this score. Formally, the alignment distance
TA
between trees T
1
and T
2
is dened as:
TA
(T
1
, T
2
) = min(A) [ A is an alignment of T
1
and T
2
(2.13)
For each alignment it is possible to construct a corresponding edit sequence
and a mapping. The converse does not hold in general: Consider the mapping
in Figure 2.7. In this mapping, nodes labeled with c are mapped to each
38 Introductory Material
other. Thus, in a possible alignment there must exist a node labeled with
c, c. Then, this node must be the son of the nodes labeled with x, and
, y. This is in contrast to the denition of a tree since a node can have at
most one parent node in a tree. From this observation, it is clear that tree
alignments form a subset of tree edit distance mappings. For trees T
1
and T
2
holds
TE
(T
1
, T
2
)
TA
(T
1
, T
2
).
Since the edit sequence denition is equivalent to the mapping denition,
it follows that not each edit sequence has a corresponding alignment. Jiang
et al. claimed that an alignment of trees actually corresponds to a restricted
tree edit in which all the insertions precede all the deletions [95]. This is
intuitive, but a formal proof is missing.
I now demonstrate that
TA
does not satisfy the triangle inequality of
the metric axioms: An arbitrary edit sequence can be divided into two edit
sequences where the one includes all insert- and the other all delete- and
relabel-operations. Assuming Jiang et al.s claimed property of alignment
compatible edit sequences (see above), the divided edit sequences are com-
patible with an alignment. From this and the fact that the tree edit distance
can be less than the tree alignment distance follows that it does not satisfy
the triangle inequality. Hence, the tree alignment distance is not a metric.
See Figure 2.9 for an example.
I am not aware of a constrained mapping denition that corresponds to
alignments, in literature.
Isomorphic Supertree A graph theoretical denition of the tree align-
ment distance is based on tree isomorphisms. In this context, the minimum
possible distance between isomorphic trees that result from the insertion of
labeled nodes in the original trees is sought. The forests that are con-
sidered by this procedure are isomorphic supertrees. Nodes that are labeled
with , should naturally score 0. Clearly, an overlay of such isomorphic
superforests and the deletion of possible , labeled nodes produces an
alignment and, hence, the models dene the same distance.
2.5 RNA Structure Comparison 39
a
x
b c
d
T
1
a
b y
c d
T
2
a
b c d
T
3
TA
= 4
TE
= 2
TA
= 1
TE
= 1
TA
= 1
TE
= 1
Figure 2.9: Consider the unit cost function, the triangle inequality of the tree
alignment distance is not satised since
TA
(T
1
, T
2
) ,
TA
(T
1
, T
3
) +
TA
(T
3
, T
1
).
In the tree edit model the triangle inequality is satised.
Algorithms Together with the denition of the tree alignment distance,
Jiang et al. proposed an algorithm that computes this distance in O([T
1
[
[T
2
[ (degree(T
1
) + degree(T
2
))
2
) time which is still the asymptotical best
algorithm [95]. For a xed number d of possible deletions and insertions,
Jansson & Lingas presented an algorithm that calculates the tree align-
ment distance
4
in O(n
2
log n k
3
d
2
) where n = max[T
1
[, [T
2
[ and k =
maxdegree(T
1
), degree(T
2
) [92].
Variants Wang & Zhao make three interesting contributions considering
the tree alignment distance for RNA structure comparison [221]:
1. They provide a model for the tree alignment distance including gaps
where the notion of gaps in a tree corresponds to tree patterns as
done in [207]. However, Wang & Zhao consider a simpler gap score
function where the score of a gap is a constant function. They derive
4
Precisely, the similarity version.
40 Introductory Material
an algorithm from Jiang et al.s algorithm that computes the alignment
distance, involving gap scores, in the same time complexity.
2. They present a modied version of Jiangs algorithm that improves the
space complexity to O(degree(T
1
)log [T
1
[[T
2
[(degree(T
1
)+degree(T
2
)))
while having the same time complexity as the Jiang algorithm. How-
ever, an optimal alignment can not be obtained by a straightforward
backtracking procedure. As space is crucial in their application they
use a naive algorithm that raises the time complexity to O([T
1
[
2
[T
2
[
(degree(T
1
) degree(T
2
))
2
) while achieving their improved space com-
plexity.
3. They consider the problem of parametric tree alignment which was
studied earlier for sequences [71] and gives clues to the parameter space
of tree alignments. In particular, the scoring of edit operations is of-
ten not deducible from the problem and therefore somewhat arbitrary.
Parametric alignment partitions the parameter space into regions such
that in each region any alignment, that is optimal for some choice of
parameters inside the region, is optimal throughout that entire region
and nowhere else. A software to visualize and explore the parameter
space is also provided.
Isolated Subtree Distance
The isolated subtree distance was rst proposed in [198]
5
and is also referred
to as the structure respecting edit distance or structure preserving mapping
distance. Intuitively, it restricts mappings such that two separate subtrees
in T
1
are mapped to two separate subtrees in T
2
. Alternatively formulated,
trees can only be mapped to trees and not to forests.
5
In [198], Tanaka & Tanaka refer to an earlier publication that introduce this dis-
tance [197]. As it is written in Japanese I was not able to validate this. Further early
contributions in the eld of tree editing, again in Japanese, are given in [1, 193196].
2.5 RNA Structure Comparison 41
Mappings A mapping M between trees T
1
and T
2
is an isolated subtree
mapping if for all (v
1
, w
1
), (v
2
, w
2
), (v
3
, w
3
) M holds:
lca(v
1
, v
2
) = lca(v
1
, v
3
) i lca(w
1
, w
2
) = lca(w
1
, w
3
)
(isolated subtree condition)
The isolated subtree distance
TI
between T
1
and T
2
is the minimum cost that
an isomorphic subtree mapping between them can achieve. Formally,
TI
(T
1
, T
2
) = min(M) [ M is an isolated subtree mapping
between T
1
and T
2
. (2.14)
Figure 2.10 shows an example of a mapping that is not an isolated subtree
mapping, but corresponds to an alignment. The metric properties of the
isolated subtree distance are proven in [236].
Algorithms Tanaka & Tanaka proposed an algorithm that computes the
isolated subtree distance in O([T
1
[ [T
2
[ minleaves(T
1
), leaves(T
2
)) time
and O([T
1
[ [T
2
[) space [198]. Zhang improved the worst case complexity to
O([T
1
[ [T
2
[) time and space [236]. Later, Richter presented an algorithm that
computes the isolated subtree distance in O([T
1
[ [T
2
[ degree(T
1
) degree(T
2
))
time and O([T
1
[ depth(T
2
) degree(T
2
)) space. For balanced trees of bounded
degree k, i.e. each internal node has k children, this algorithm consumes less
space than Zhangs Algorithm.
Top-Down Distance
Although I introduce the top-down distance at the end of this survey, its
introduction by Selkow opened the discipline of tree edit distances in 1977
[177]. He considered a tree edit distance model where insertions and deletions
are restricted to the leaves of a tree: Only leaves may be deleted, and a node
may be inserted only as a son of a leaf.
42 Introductory Material
a
x
b c
d
T
1
a
b c d
T
2
a, a
x,
b, b c, c
d, d
A
Figure 2.10: The mapping between T
1
and T
2
is not an isolated subtree mapping,
since it violates the isolated subtree condition. In particular, for T
1
holds lca(b, c) ,=
lca(b, d) but for T
2
holds lca(b, c) = lca(b, d). Even this mapping is not a valid
isolated subtree mapping, there exists a corresponding alignment A.
Mappings In terms of mappings, this has the consequence that whenever
w.l.o.g a node v in T
1
is mapped to some node in T
2
, all ancestor nodes of
v must be included in the mapping. Given some mapping M between T
1
and T
2
, let M[
1
and M[
2
be the nodes in T
1
and T
2
that are touched by M,
respectively. Let ancs
T
(v) denote the set of all ancestor nodes of v. Formally,
a mapping M between trees T
1
and T
2
is a top-down mapping if the following
holds:
(v, w) M ancs
T
1
(v) M[
1
and ancs
T
2
(w) M[
2
(2.15)
The top-down distance
TD
between T
1
and T
2
is the minimum cost that an
2.5 RNA Structure Comparison 43
top-down mapping between them can achieve:
TD
(T
1
, T
2
) = min(M) [ M is a top-down mapping between T
1
and T
2
(2.16)
Recently, Valiente proposed a dual model, a bottom-up distance between
Trees, where deletions and insertions must begin at the root level [210].
Algorithms Selkows algorithm computes the top-down distance in O([T
1
[
[T
2
[) time and space [170, 177]. The algorithm was implemented and applied
to the problem of identifying syntactic dierences in [235].
2.5.4 Related Problems
Similar Consensus Problems
The similar consensus problem is the problem of nding a largest approxi-
mately common substructure in trees. For strings, a substructure is a sub-
word. For graphs, a substructure can be dened as a connected subgraph
which for trees results in my denition of a tree pattern. Let d be an integer,
the similar consensus problem is to nd pattern trees T
1
of T
1
and T
2
of
T
2
such that the distance between T
1
and T
2
is within distance d and there
does not exists any other substructure T
1
of T
1
and T
2
of T
2
that satisfy
the distance constraint and [T
1
[ + [T
2
[ [T
1
[ + [T
2
[. The similar consensus
problem was studied for the dierent distances that were presented in this
section:
distance time complexity studied in
tree edit distance
O(d
2
[T
1
[ [T
2
[ C(T
1
) C(T
2
)),
where C(T
i
) = minleaves(T
i
), depth(T
i
)
[214]
tree alignment distance O([T
1
[ [T
2
[ (degree(T
1
) + degree(T
2
))
2
) [215]
isolated subtree distance O(d
2
[T
1
[ [T
2
[) [216]
top-down distance O(d
2
[T
1
[ [T
2
[) [217]
44 Introductory Material
Tree Inclusion Problems
The tree inclusion problem is a variant of the general tree edit distance. In
terms of a maximum isomorphic subtree, a tree pattern T
p
is included in a
target tree T if T
p
can be obtained from T by node deletions. This corre-
sponds to an edit model that only supports the functions relabel and insert
where T
p
is the rst and T the second tree. Kilpainen & Mannila presented
an algorithm that solves this problem in O([T
p
[ [T[) [98]. Improvements and
variations of their algorithm are proposed in [3, 20, 160]. The classic problem
of tree pattern matching is a restricted version of the tree inclusion problem.
The deletion of nodes in the target tree is only allowed for leaf nodes in T
(and the trees that result from such deletions), which is equivalent to subtree
removals in T. This corresponds to the tree inclusion problem in the domain
of Selkows top-down distance. Among others, substantial contributions are
reported in [28, 38, 86, 106, 119, 125, 157].
Zhang et al. considered the approximate tree matching in the presence of
variable length dont cares (VLDC) [219]. The query tree can contain wild-
cards that may match multiple nodes. For example, symbol | substitutes
for a part of a path from the root to a leaf in the target tree. Symbol ^
matches a path and all subtrees emanating from the nodes on that path.
Building upon that wildcards, the authors introduced a querying language
for inexact matching of trees.
2.5.5 Arc Annotated Sequences
The pure sequence based approaches to compare RNA secondary structures
are known to have the problem of violating the tree structure (see Section
2.5.2). On the other hand, tree edit based approaches are so far limited to
compare RNA secondary structures. Moreover, in the coarse grained tree
representation the meaning of tree edit operations in the process of editing
RNA structures is dicult to motivate biologically. In the natural tree re-
presentation, the tree edit model cannot account adequately for a deletion
2.5 RNA Structure Comparison 45
of a base-pair bond. This gave rise to the idea of incorporating structural
constraints into sequence alignment strategies.
The rst structural rened sequence alignment algorithm was proposed
by Sanko [172], although for the more sophisticated problem of folding and
aligning simultaneously. Bafna et. al. introduced the concept of RNA strings
which include both, the primary sequence and the secondary structure in-
formation [4]. Beside matching problems on RNA strings, they introduced
an alignment model for RNA strings. Evans generally studied annotation
schemes that add auxiliary information to a sequence. These can be taken
into account when the sequences are analyzed [45]. Evans introduced the
general notion of arc-annotated sequences. An arc is a link joining two dif-
ferent symbols of a sequence and can be used to represent a binary relation
between them. The denition of an arc-annotated sequence complies to the
denition of a tertiary structure
6
(see Section 2.2). As a natural extension
of the longest common subsequence problem, Evans introduced the longest
arc-preserving common subsequence problem [45]. This problem is not only
studied extensively due to its potential application for RNA structure com-
parison, but also because it has a compact denition, is easy to understand
and turned out to be NP-hard even for RNA secondary structures [114].
Zhang et al. introduced a further edit model for RNA structures includ-
ing tertiary interactions [242]. For RNA secondary structures, their model
corresponds to the tree edit model in conjunction with the natural tree re-
presentation. Finally, Jiang et al. suggested a set of edit operations for RNA
structures that are biological motivated and form a superset of edit opera-
tions of the formerly mentioned models [94]. I introduce this general edit
model for RNA structures rst and use its terminology to give a uniform
description of the other models.
6
A general arc-annotated structure additionally allows a connection of one to many
characters. I neglect this case since complex interactions like base-triplets are beyond the
scope of this thesis.
46 Introductory Material
AAAGAAUAAUAUUACGGGACCCUAUAAACGAAAACCG
AGAGAAUAACAUU-CGGGACCCUAUAAAC-AAAAC-G
base-pair mismatch base-pair deletion
base-pair altering
base-pair match
base-pair breaking
Figure 2.11: Structural edit operations of Jiang et al.s general edit model for
RNA structures. Sequence edit operations that do not involve base-pairs are omitted
in this gure.
A General Edit Model for RNA Structures
Jiang et al. proposed a set of edit operations for RNA structures that are
motivated by the evolution of structural RNA [94].
Edit operations An edit operation that aects the primary and the sec-
ondary structure transforms an RNA structure (S
1
, P
1
) into a structure
(S
2
, P
2
) by modifying both, S
1
and P
1
. Since a deletion or insertion of a base
in S
1
requires to adjust the indexes of the base-pairs in P
1
, the denition of
edit operations is intricate on that level. I introduce a terminology for struc-
tural edit operations that is consistent with the terminology of the sequence
and tree edit model. To uniquely dene structural edit operations, the posi-
tions that are aected by the operation must be specied as well as the new
base for base-replacements. For convenience, I dene the rules in terms of
their eect on sequence and structure. The parameterized edit operations can
be derived from this description. Let be u, v, w
RNA
and a, b, c, d
RNA
.
Let the concatenated string u
( v
) w
u a v b w
u
( v
) w
u c v d w
(base-pair replacement)
This notation is read as follows: (S
1
, P
1
) is edited to (S
2
, P
2
) where S
1
=
uavbw, P
1
= u
(v)w
, S
2
= ucvdw
, and P
2
= u
(v
( v
) w
u a v b w
u
u v w
(base-pair deletion)
During the evolution of an RNA structure, it can happen that the bond
between two bases becomes too weak due to mutations in other regions of
the structure. Accordingly, the disappearance of a base-pair bond is among
the structural edit operations:
u
( v
) w
u a v b w
u
. v
. w
u a v b w
(base-pair breaking)
The scenario where a base-pair bond disappears because one of the pairing
bases is deleted is modeled by either of the following two edit-operations.
u
( v
) w
u a v b w
u
. w
u v b w
(base-pair altering right)
48 Introductory Material
u
( v
) w
u a v b w
u
. v
u a v w
(base-pair altering left)
Bases that are not paired undergo the classical sequence edit operations:
u
. v
u a v
u
. v
u c v
(base-replacement)
u
. v
u a v
u
u v
(base-deletion)
Each of the edit operations can also be read and applied from right to left.
For edit operations that involve the deletion of bases or base-pairs this denes
the corresponding insert versions. Figure 2.11 shows the edit operations in
an alignment on the sequence and structure level.
The concept of edit-sequences can be naturally applied: Let E be an
edit-sequence e
1
, e
2
, . . . , e
n
. E transforms (S, P) into (S
, P
) if there is
a sequence of structures (S
0
, P
0
), (S
1
, P
1
), . . . , (S
n
, P
n
) such that (S, P) =
(S
0
, P
0
), (S
, P
) = (S
n
, P
n
) and (S
i
, P
i
) results from the application of e
i
to
(S
i1
, P
i1
) for i [1, n]. Let be a cost function dened on edit operations.
The cost of an edit-sequence E is the sum of costs of its edit operations,
that is: (E) =
n
i=1
(e
i
). The general edit distance
GE
between structures
(S
1
, P
1
) and (S
2
, P
2
) is the minimum cost that is necessary to transform
(S
1
, P
1
) into (S
2
, P
2
). Formally,
GE
((S
1
, P
1
), (S
2
, P
2
)) = min(E) [ E is an edit sequence
transforming (S
1
, P
1
) into (S
2
, P
2
). (2.17)
Algorithms Jiang et al. provided algorithms and complexity results for
a xed scoring scheme, i.e. the cost of an edit operation does not account
2.5 RNA Structure Comparison 49
for the involved bases, or equivalently, it is a constant [94]. Computing
GE
between (S
1
, P
1
) and (S
2
, P
2
) where P
1
is a tertiary structure and P
2
= is
MAX SNP-hard. For a restricted model that omits the base-pair altering and
base-pair deletion edit operations, they propose an algorithm that requires
O([S
1
[
2
[S
2
[
2
) time. If P
1
is a secondary structure and P
2
= the general
(unrestricted) problem is solvable in O([S
1
[ [S
2
[) time. The case when both
P
1
and P
2
are secondary structures is not considered in [94]. I will show in
Section 2.5.5 that the the general edit model with a certain scoring function
is NP-hard.
Bafna et al.s Model
Bafna et al. introduced a sequence alignment problem for RNA secondary
structures that maximizes both, base and base-pair replacement scores [4].
Let (a, b) be the score for replacing base a by base b and let (ab, cd) be
the score for relabeling a base-pair ab by base-pair cd. Given an alignment
A of sequences S
1
and S
2
, I dene A
S
i
to be the ith row in A. Let gap
S
i
[j]
be the number of gaps that are inserted in S
i
up to the jth position in A.
Formally:
gap
S
i
[j] =
_
_
_
j if A
S
i
[j] =,
[l [ A
S
i
[l] = and l j[ otherwise.
Bafna et al. do the following trick to for a compact denition of their model:
They dene S
i
[0] =
. If there is a gap in S
1
at position i, S
1
[i gap
S
1
[i]]
evaluates to which corresponds to an insertion. The corresponding holds
for S
2
. Let m be the number of columns in an alignment A. The score of A
is the sum of scores of the aligned bases, be they paired or unpaired, and the
scores of the aligned base-pairs. The sequence score is dened as
(A) =
1im
(S
1
[i gap
S
1
[i]], S
2
[i gap
S
2
[i]]).
50 Introductory Material
The base-pair scoring is dened as:
(A) =
1ijm
(S
1
[i gap
S
1
[i]]S
1
[j gap
S
1
[j]], S
2
[i gap
S
2
[i]]S
2
[j gap
S
2
[j]])
where (i gap
S
1
[i], j gap
S
1
[j]) P
1
and (i gap
S
2
[i], j gap
S
2
[j]) P
2
.
Bafna et al.s score
BAF
is the sum of these scores:
BAF
(A) = (A) + (A) (2.18)
The similarity score of secondary structures (S
1
, P
1
) and (S
2
, P
2
) is then given
by:
BAF
((S
1
, P
1
), (S
2
, P
2
)) = max
BAF
(A) [ A is an alignment of S
1
and S
2
(2.19)
Note that S
1
and S
2
are sequences and, thus, A is a sequence alignment.
Algorithms Bafna et al. provide an algorithm that computes
BAF
((S
1
, P
1
), (S
2
, P
2
)) in O([S
1
[
2
[S
2
[
2
).
Bafna et al.s Model Revisited Bafna et al.s model has been criti-
cized for not systematically treating base-pairs as basic units [45, 94]. I show
that their model can be expressed in the general edit model with a special
scoring scheme: Function scores base replacements, base-insertions and
base-deletions. The scoring contributions are (a, b), (, b) and (a, ), re-
spectively. Clearly, function in Equation (2.18) does only account for
base-pair replacements. In this case, the function contributes additionally
to the overall score for the aligned base-pairs. Thus, the score for a base-pair
replacement of ab with cd is (ab, cd) + (a, c) + (b, d). Otherwise, a
base, be it paired or unpaired, can be aligned with any other base and the
scoring contributions for aligning a base a with a base b is (a, b). A scoring
contribution of 0 for the base-pair breaking operation allows to align paired
2.5 RNA Structure Comparison 51
bases to unpaired bases without a penalty. The deletion of a base-pair is
composed of a base-pair breaking and two base-deletions. The correspond-
ing holds for the base-pair insertion. A base-pair altering is composed of a
base-pair breaking, a base-match and a base-indel. Summarizing these ob-
servations,
BAF
can be calculated by employing the following scoring scheme
for Jiang et al.s general edit model:
edit operation score
base replacement (a, b)
base indel (a, ) and (, b)
base-pair replacement (ab, cd) + (a, c) + (b, d)
base-pair breaking 0
I conclude that Bafna et al.s model is a proper structural alignment model
which means that it can be expressed in Jiang et al.s general edit model.
Whether the scoring of edit operations is a good choice or not remains to be
analyzed.
The Longest Arc-Preserving Common Subsequence Problem
The longest arc-preserving common subsequence problem is an extension of
the classic longest common subsequence problem. A sequence S
is a subse-
quence of a sequence S if S
that is a subsequence of S
1
, S
2
, . . . , S
n
.
Mostly driven by the application of RNA structure comparison, includ-
ing tertiary structures, Evans generalized the problem for arc-annotated se-
quences [45]. Let (S
1
, P
1
) and (S
2
, P
2
) be arc annotated sequences which
means that P
1
and P
2
can be tertiary structures throughout this section.
A longest common subsequence S
of S
1
and S
2
induces a mapping between
characters in S
1
and S
2
by associating the characters i
k
in S
1
and j
k
in S
2
, that
correspond to the kth position of S
. Suppose M = (i
1
, j
1
), (i
2
, j
2
), . . . , (i
|S
|
, j
|S
|
)
is such a mapping. The longest common subsequence S
is arc-preserving if
52 Introductory Material
the arcs touched by the mapping are preserved. That is, for any (i
k
, j
k
), (i
l
, j
l
) M
holds:
(i
k
, i
l
) P
1
i (j
k
, j
l
) P
2
. (2.20)
The longest arc-preserving common subsequence (LAPCS) problem is to nd
a longest common subsequence S
that is arc-preserving.
Dierent instances of the problem, depending on the complexity of the arc set
(here the complexity of RNA structures), are studied in the literature. The
relevant instances in the context of RNA sequence and structure comparison
are LAPCS(P
1
, P
2
) where P
i
belongs to one of the following classes:
PLAIN: no structure, i.e. P
i
=
NESTED: P
i
is a secondary structure
CROSSING: P
i
is a tertiary structure
I follow this terminology since it is established in the literature concerning
LAPCS problems [2, 45, 93, 114]. I review the most important results and
comment on the LAPCS(NESTED,NESTED) problem which is particularly
interesting for comparing RNA secondary structures in the following.
Algorithms LAPCS(PLAIN,PLAIN) is the well known longest common
subsequence problem which can be solved in O([S
1
[ [S
2
[) [76]. If the num-
ber of sequences is unrestricted this problem is NP-complete [124]. Oth-
erwise, if at least one structure is CROSSING, the problem is NP-hard
[45]. A maximization optimization problem, such as the LAPCS problem, is
-approximable if there exists a polynomial time algorithm / and a positive
number such that the output of / is within a factor
1
of the optimum. If
at least one structure is CROSSING. the LAPCS problem is also MAX SNP-
hard which has the consequence that it is not approximable within = 1 +
for some positive [93]. A 2-approximation algorithm for these problems is
proposed in [93].
2.5 RNA Structure Comparison 53
The probably most relevant problem in the context of RNA structures is
the LAPCS(NESTED,NESTED) problem to compare RNA secondary struc-
tures. The NP-hardness of this problem was shown in [114].
A LAPCS(NESTED,NESTED) that can be obtained by at most k
1
and k
2
character deletions (together with the corresponding arcs) can be calculated
in O(3.31
k
1
+k
2
) [2]. A polynomial time algorithm for the LAPCS(NESTED,
PLAIN) problem, running in O([S
1
[ [S
2
[
3
) time, is presented in [93].
LAPCS(NESTED,NESTED) Revisited A longest arc-preserving com-
mon subsequence of secondary structures (S
1
, P
1
) and (S
2
, P
2
) maps charac-
ters from S
1
to S
2
. In the following, I observe which edit operations of
the general edit model are compatible with such a mapping, resulting in an
equivalent edit based description of the LAPCS(NESTED,NESTED) prob-
lem. The arc-preserving property (2.20) of a longest arc-preserving common
subsequence guarantees that if both bases of a base-pair are mapped, then
they must be mapped to bases that are also paired. In terms of the general
edit model for RNA structures this means that there must exist a base-
pair match operation but no base-pair breaking. The base-pair match adds
two new characters to the longest arc-preserving common subsequence. The
base-pair breaking operation can be excluded by assigning an innite nega-
tive score to it. If only one base of a base-pair is mapped, then the other base
must not exist in the mapping. This adds one new character to the longest
arc-preserving common subsequence. The arc-altering operations model ex-
actly this scenario. Clearly, a base-pair deletion, i.e. both partners and the
connecting arc are deleted, is also compatible with a LAPCS mapping. If a
character is not paired, it can be mapped (matched) to another unpaired base
(the mapping to a paired base is treated by the base-pair altering function)
or not appear in the mapping. The sequence edit operations base-match and
base-indel handle these cases. Clearly, a longest arc-preserving common sub-
sequence does not allow any mismatches and, hence, the scoring contribution
for those cases must be . Summarizing these observations, the length of
54 Introductory Material
a LAPCS can be calculated in Jiang et al.s general edit model using the
following scoring scheme:
edit operation score
base match 1
base mismatch
base indel 0
base-pair match 2
base-pair mismatch
base-pair indel 0
base-pair breaking
base-pair altering 1
The LAPCS can be derived from the resulting alignment. The complexity of
the LAPCS(NESTED,NESTED) problem was an important question until
Lin et al. proved it to be NP-hard [114]. Since the computation of the general
edit distance using the above scores solves the LAPCS problem, I conclude
that the computation of the general edit distance for RNA secondary struc-
tures is a NP-hard problem for the above scoring scheme. I assume that the
complexity results from the presence of the base-pair altering operations. If
those must be considered explicitly, i.e. the score is not build from simpler
edit operations, the number of resulting subproblems grows exponentially.
This remains to be further analyzed.
Zhang et al.s Model
Zhang et al. considered RNA secondary structure trees in the natural re-
presentation that are compared under the tree edit and alignment model in
[238]. The entities of the tree nodes are bases and base-pairs (see Section 2.3).
Thus, the classic edit operations replace, insert and delete can be applied to
either an unpaired base or a base-pair. A replacement of a base by a base-
pair is prohibited. Ma et al. extended this model for general RNA structures
by extending the mapping concept of the tree edit model for general RNA
2.5 RNA Structure Comparison 55
structures which is the central denition of this line of work [123, 222, 242].
The essential extension of the mapping is a new condition for crossing
base-pairs. Intuitively, the crossing pattern of tertiary interactions should
be conserved. I do not go into the details of their mapping denitions, since
their model was constructed on the assumption of certain edit operations on
structures. I will revisit their models in terms of Jiang et al.s general model.
Algorithms Computing
ZHA
((S
1
, P
1
), (S
2
, P
2
)) where P
1
and P
2
are ter-
tiary structures is MAX-SNP hard [123]. Ma et al. considered a sim-
pler edit model for tertiary structures which restricts mappings between
tertiary structures to preserve secondary structure. Essentially, their al-
gorithm deletes tertiary structure interactions such that the resulting sec-
ondary structure alignment is optimal. Let stem(P) be the number of stack-
ing regions (stems) in an RNA structure (S, P). Their algorithm requires
O(stem(P
1
) stem(P
2
) [S
1
[ [S
2
[) time and O(stem(P
1
) stem(P
2
)) space.
Collins et al. presented a variant of
ZHA
with the constraint that bases and
base-pairs can be specied that must be replaced by each other. They do
not improve the complexity, but their technique reduces the search space and
consequently the runtime [29]. Moreover, they propose a two step strategy
for tertiary structures: In the rst step, tertiary structures are ignored re-
sulting in a secondary structure alignment. In the second step, the secondary
structure alignment is used to restrict the tertiary structure alignment.
Zhang et al.s Model Revisited The edit operations in Zhang et al.s
edit model can be applied to either unpaired bases or base-pairs. According to
Jiang et al.s model the structural edit operations are: base-pair replace and
base-pair indel. The sequence counterparts are the operations base-replace
and base-indel. An edit operation that works on both unpaired base and a
base-pair is not dened in their model. Thus, there is no base-pair altering
and base-pair breaking operation. An innite negative score for these edit
operations is sucient to calculate Zhang et al.s model under the general
edit model for RNA structures:
56 Introductory Material
edit operation score
base match
m
base mismatch
mm
base indel
id
base-pair match
m
base-pair mismatch
mm
base-pair indel
id
base-pair breaking
base-pair altering
2.5.6 Graphs
The most general mathematical construct to model relations between certain
objects is a graph. Clearly, an RNA tertiary structure can be modeled as
a graph where the vertices are bases and the edges are interactions between
them. Note, this concerns topological rather than geometric aspects of RNA
molecules. For example, such a graph abstracts from the relative angles
between stems.
Edit Models
Wan et al. considered the generalization of the tree edit model for graphs,
these are approximate graph isomorphism and subgraph isomorphism [218].
Both are known to be NP-complete. They outline an application where RNA
structures (not restricted to secondary structures) are compared under this
model.
Eigenvalue Spectrum of the Laplacian Matrix
In the Schlickss group two simpler types of graphs are considered [47, 52, 53].
The one are tree graphs, corresponding to a collapsed form of the natural
tree representation (see Section 2.3), where collapsed means that connected
non-branching nodes are merged to one node (ignoring labels). The other
2.6 Discussion 57
is dual graphs. A dual graph can represent all tree like RNA structures as
well as pseudoknotted structures. They focus on the problem of quantita-
tively characterizing known structural motifs to identify missing or favored
motif topologies. For the topological classication of structures they consider
the eigenvalue spectrum of the Laplacian matrix obtained from the graphs
adjacency matrix. In particular, the second eigenvalue reects the overall
pattern of connectivity for a graph. Barash used the second eigenvalue to
detect structural changes in RNA that are caused by single point mutations
[6].
2.6 Discussion
The multitude of structure comparison models presented in Section 2.5 gives
rise to the question why this thesis presents another RNA structure com-
parison model. Otherwise, this shows that RNA structure comparison is an
active research eld and the problem is not suciently solved.
Nowadays, the detection of locally structure conserved motifs in RNA
molecules is a hot topic in molecular biology. On the algorithmic side, the
problem of nding local similar structures, given RNA secondary structures,
has not been studied thoroughly. The similar consensus problem for trees is
the only contribution, I am aware of, to detect local similar regions in RNA
secondary structures (see 2.5.4). However, this model calculates distance
instead of similarity. As the distance between equal substructures is always
zero, the size of substructures must be considered additionally. Hence, in the
similar consensus problem, the largest subtree within some distance threshold
is sought. A similarity version in spirit of the Smith-Waterman algorithm
[184] for trees would be more convenient to calculate local similar structures.
Moreover, the similar consensus problem consideres subtrees. This is too
restrictive since neighboring subtrees should be considered as local structures
as well, i.e. two adjacent stems in a multiloop could be the most similar
substructure which corresponds to two dierent subtrees.
58 Introductory Material
Another problem that has not been addressed thoroughly is the prob-
lem of comparing multiple RNA secondary structures. As multiple sequence
alignments emphasize sequence conserved regions, multiple structure align-
ments emphasize structural conserved regions. A multiple structural align-
ment is useful for phylogenetic analyses, identication of conserved motifs,
and domain and structure prediction.
In the follwowing, I motivate my choice of the tree alignment model to
address the above problems. The model that I consider should have the
following properties:
a biologically reasonable edit model,
suitable for a generalization to multiple structures,
build upon an adequate data structure for local similarity problems,
allow algorithms with a low computational complexity.
Base-pair distances are suitable to compare structures that have the same
length, i.e. the same number of nucleotides. If the structures to be compared
have a dierent length, edit based approaches provide a better distance mea-
sure.
The approach to apply classical sequence alignments to string represen-
tations of RNA secondary structures is more a historical remark. At the
time these were invented, structural alignment strategies were just about to
emerge. More elaborate models are edit and alignment models for trees and
arc-annotated sequences.
Unlike sequences, trees are convenient to express substructures of RNA
secondary structures as coherent parts of the data structure. In explanation,
adjacent base-pairs are neighbored in a tree while in a sequence they are
split in the 5
and 3
and
F
. Note
that F
and F
FA
(F, G) = max(A) [ A is an alignment of F and G. (3.2)
62 Algorithms for Global and Local Forest Similarity
3.3 A Global Forest Alignment Algorithm
Jiang et al. presented an algorithm for the calculation of the tree align-
ment distance which has the best known worst case complexity O([T
1
[ [T
2
[
(degree(T
1
) + Degree(T
2
))
2
) [95].
The recursive nature of forests leads naturally to dynamic programming
algorithms. This sort of algorithms is structurally recursive and avoid recal-
culation of the same subproblem by tabulating intermediate results. Adher-
ing to the principles of algebraic dynamic programming [56, 58], I consider the
search space of a problem (all possible alignments) and its evaluation (e.g.
scoring, counting) separately. To derive a dynamic programming algorithm
from the search space observations, two question must be answered:
1. Which subproblems arise in the recursion scheme?
2. What is the order of calculation?
The answer to the rst question identies the relevant subforests of the prob-
lem. Thereupon, an index based notation that is necessary for an imple-
mentation based on matrix recurrences can be derived. The answer to the
second question is important to formulate an imperative algorithm. Clearly,
everything must be calculated before it is used
1
.
3.3.1 The Search Space of Forest Alignments
To calculate similarity of forests, all their alignments must be considered.
This set is the search space. The enumeration of all possible alignments of
two forests can be done in a structurally recursive fashion. Suppose A is an
alignment of F and G. Depending on label (A
and
A
) is of the
form (a, b), (, b) or (a, ) for some a, b . This leads to the following case
distinction:
1. If label (A
) and b = label (G
),
A
is an alignment of F
and G
and A
is an alignment of F
and G
.
2. If label (A
),
for some r [0, len(G)], A
is an alignment of F
is an alignment of F
),
for some r [0, len(F)], A
and
A
.
Proof. Follows directly from Denition (3.1) and the denition of function
in Equation (2.10).
Figure 3.1 gives a graphical view of Lemma 3.1. The search space of all possi-
ble alignments of F and G is determined by the Cases 1-3, and by all possible
choices of split position r in Cases 2 and 3. Scoring the alignments of the
search space follows the same structural recursive pattern. The similarity of
F and G is the maximum of the scores (a, b), (a, ) and (, b), each added
to the similarity scores of the appropriate subforests. Figure 3.2 shows the
recursive formula for the calculation of the forest alignment similarity that
64 Algorithms for Global and Local Forest Similarity
F
a b
(a, b)
(a) Case 1
F
r:G G:len(G) r
A
a b
(a, )
(b) Case 2
Figure 3.1: Illustration of Case 1 and Case 2 of Lemma 3.1. The shaded triangle
symbolizes F
of F such that F
is a csf of F and F
is a csf of F
, then F
is a
csf of F (closed subforest transitivity). The pairs of subforests that actually
arise in the recursive calculation of the tree alignment similarity, the relevant
subforests, are subject of the following Lemma.
Lemma 3.2. Let
F and
G be maximal closed subforests of F and G, re-
spectively. The pairs of subforests that are relevant for the calculation of
3.3 A Global Forest Alignment Algorithm 65
relabel (F, G) = (label (F
) label (G
)) +
FA
(F
, G
) +
FA
(F
, G
)
delete(F, G) = (label (F
) ) + max
r[0,len(G)]
_
FA
(F
, r:G) +
FA
(F
, G:(len(G) r))
_
insert(F, G) = ( label (G
)) + max
r[0,len(F)]
_
FA
(r:F, G
) +
FA
(F:(len(F) r), G
)
_
FA
(F, G) =
_
_
0 if F = and G =
(label (F
) ) +
FA
(F
, ) +
FA
(F
, ) if F ,= and G =
( label (G
)) +
FA
(, G
) +
FA
(, G
) if F = and G ,=
max
_
_
relabel (F, G)
delete(F, G)
insert(F, G)
otherwise
Figure 3.2: Recursive function to calculate the forest alignment similarity
FA
of
forest F and G.
FA
(F, G) due to Figure 3.2 have the form (
F:j,
k
]
G[
l
) and (
j
]
F[
i
,
G:l).
Proof. Both pairs of csfs (
F:j,
k
]
G[
l
) and (
i
]
F[
j
,
G:l) where each of i, j, k, l
equals 0 represent the csf pair (
F,
G). I consider all possible transitions to
subforests due to Lemma 3.1. These are:
Case 1: (F
, G
), (F
, G
),
Case 2: (F
, k:G), (F
, G:l),
Case 3: (i:F, G
), (F:j, G
).
The following graph shows the pairs of csfs (
F : j,
k
]
G[
l
) and (
j
]
F[
i
,
G: l)
surrounded by boxes. The arrows indicate possible transitions targeting to
the index pair that is sucient to represent the result of the transition.
66 Algorithms for Global and Local Forest Similarity
(
F:j,
l
]
G[
k
) (
j
]
F[
i
,
G:l)
(F
, G
)
(F
, G
)
(F
, r:G)
(F
, G:r)
(r:F, G
)
(F
, G
)
(F
, G
)
(r:F, G
)
(F:r, G
)
(F
, r:G)
Each transition results in pairs of closed subforests that can be expressed in
one of the two forms. Thus, the subforests (
F:j,
k
]
G[
l
) and (
i
]
F[
j
,
G:l) are
sucient to describe the relevant subforests. For all combinations of i, j, k, l,
the subforests (
F:j,
k
]
G[
l
) and (
i
]
F[
j
,
G:l) can be reached from the pair (F, G)
by a series of transitions. Hence, Lemma 3.2 describes exacltly the relevant
subforests.
It is obvious that each relevant subforest is a closed subforest. Note that
the converse does not hold, i.e. the pair of csfs (
i
]F[
j
,
k
]G[
l
) where each of
i, j, k, l is greater than 0 is not a relevant pair of subforests for the calculation
of
FA
(F, G).
Tabulation
For a transparent description of my algorithms, I use a two stage mapping
F
F
. The function
F
provides a mapping from csfs of F to index pairs
and allows for ecient transitions from a csf to its relevant subforests. The
function
F
maps these index pairs to linear table indices. In this way, I
reduce table dimension and space consumption in practice. For any non-
empty csf F
of F, I dene
F
(F
) = (pre
F
(F
), len(F
)). (3.4)
The empty forest is represented ambiguously by any index pair (i, 0). If (i, j)
is an index pair representing a csf , then i is called the node index and j the
length index.
3.3 A Global Forest Alignment Algorithm 67
1
2 3
4 5 6
7 8
9 10
F
(F
) = (3, 3)
F
(F
) = (4, 3)
F
(F
) = (7, 2)
(3 + 1, noc
F
[3])
(rb
F
[3], 3 1)
F
Figure 3.3: This gure illustrates the closed subforest index pair representation.
The transitions of the csf F
to F
and F
) = (i + 1, noc
F
[i]). If F[i] is not a leaf, i + 1 is the index of
the leftmost child of F[i] and noc
F
[i] = len(F
). If F[i] is a leaf,
the resulting csf is the empty forest represented by (i + 1, noc
F
[i]) =
(i + 1, 0).
F
(F
) = (rb
F
[i], j 1). If F[i] has a right brother, this is quite
obvious. Otherwise j 1 = 0 and the resulting forest is empty.
Clearly,
F
(F
) and
F
(F
F
(F
). Splitting F
into r : F
and F
such that S
4
(
F
(F
),
G
(G
)) is the similarity
of csfs F
and G
F
(i, j) =
_
_
_
0 if j = 0
oset
F
[i] + j otherwise
(3.5)
Table oset
F
can be precomputed in O([F[) time and space. I dene the
right inverse
1
F
of
F
by
1
F
(0) = (1, 0) and
1
F
(
F
(i, j)) = (i, j) for
F
(i, j) ,= 0.
I combine the previous ideas to give a dense tabulating algorithm calcu-
lating global forest alignment similarity. I compute matrix S
dened by
2
It cannot eliminate entries that correspond to pairs of csfs that are not relevant due
to Lemma 3.1.
3.3 A Global Forest Alignment Algorithm 69
S
(x, y) =
_
_
0 if x = 0 and y = 0 (1)
(lb
F
[i], )
+S
(
F
(i + 1, noc
F
[i]), 0))
+S
(
F
(rb
F
[i], j 1), 0) if x > 0 and y = 0 (2)
(, lb
G
[k])
+S
(0,
G
(k + 1, noc
G
[k]))
+S
(0,
G
(rb
G
[k], l 1)) if x = 0 and y > 0 (3)
max
_
_
_
relabel (x, y)
delete(x, y)
insert(x, y)
otherwise (4)
where (i, j) =
1
F
(x) and (k, l) =
1
G
(y)
relabel (x, y) = (lb
F
[i], lb
G
[k])
+S
(
F
(i + 1, noc
F
[i]),
G
(k + 1, noc
G
[k]))
+S
(
F
(rb
F
[i], j 1),
G
(rb
G
[k], l 1))
delete(x, y) = (lb
F
[i], )
+ max
0rl
_
S
(
F
(i + 1, noc
F
[i]),
G
(k, r))
+S
(
F
(rb
F
[i], j 1),
G
(rb
r
G
[k], l r))
_
insert(x, y) = (, lb
G
[k])
+ max
0rj
_
S
(
F
(i, r),
G
(k + 1, noc
G
[k]))
+S
(
F
(rb
r
F
[i], j r),
G
(rb
G
[k], l 1))
_
Figure 3.4: The recurrences for S
. The Cases
(1)-(3) of the recurrences involving empty forests are obvious. The similarity of
two non-empty forests is determined by the maximum score for alignments A that
have a relabeling, or a deletion, or an insertion at the root. The functions relabel ,
delete, and insert reect the case distinction in Lemma 3.1.
S
(
F
(
F
(F
)),
G
(
G
(G
))) =
FA
(F
, G
) (3.6)
for all csfs F
and G
(
F
(1, len(F)),
G
(1, len(G))) gives
the similarity of F and G. The recurrences for S
(
F
(i, j), 0) depends
on entries S
(
F
(i+1, noc
F
[i]), 0) and S
(
F
(rb
F
[i], j 1), 0). That is, either
the node index strictly increases, or if rb
F
[i] = 0, then j = 1 and hence
F
(rb
F
[i], j 1) = 0. If
G
(k, l) > 0, then the corresponding holds for
S
(0,
G
(k, l)). If
F
(i, j) > 0 and
G
(k, l) > 0 in the delete case (Lemma 3.1
Case 2), S
(
F
(i, j),
G
(k, l)) depends on S
(
F
(i +1, noc
F
[i]),
G
(k, r)) and
S
(
F
(rb
F
[i], j1),
G
(rb
r
G
[k], lr)) for some r [0, l]. Thus, either the node
index increases strictly, or the length index decreases. The corresponding
holds for the insert case (Case 3). Thus, S
and
G
of F and G, while fullling the data dependencies that were just discussed.
The iteration over the length index makes use of a table maxcsen
F
, dened
by:
maxcsen
F
[i] = maxj [ (i, j) is a csf of F. (3.7)
The table maxcsen
F
can be precomputed in O([F[) time and space, since
maxcsen
F
[i] =
_
_
_
1 if rb
F
[i] = 0
maxcsen
F
[i] = 1 + maxcsen
F
[rb
F
[i]] otherwise.
The corresponding holds for table maxcsen
G
. An optimal alignment can be
obtained by backtracking. To facilitate this, the split position r should be
stored with each optimal value resulting from a deletion or an insertion.
3.3 A Global Forest Alignment Algorithm 71
Input: Forests F and G, given by tables
lb
F
, lb
G
, noc
F
, noc
G
, rb
F
, rb
G
, oset
F
, oset
G
,
maxcsen
F
,maxcsen
G
Output:
FA
(F, G) stored at
S
(
F
(i, maxcsen
F
[i]),
G
(k, maxcsen
G
[k]))
S
(0, 0) 0 1
for i [F[ to 1 do 2
for j 1 to maxcsen
F
[i] do Calculate S
(
F
(i, j), 0) 3
end 4
for k [G[ to 1 do 5
for l 1 to maxcsen
G
[k] do Calculate S
(0,
G
(k, l)) 6
end 7
for i [F[ to 1 do 8
for k [G[ to 1 do 9
for j 1 to maxcsen
F
[i] do 10
Calculate S
(
F
(i, j),
G
(k, maxcsen
G
[k])) as in Figure 3.4 11
end 12
for l 1 to maxcsen
G
[k] do 13
Calculate S
(
F
(i, maxcsen
F
[i]),
G
(k, l)) as in Figure 3.4 14
end 15
end 16
end 17
Algorithm 3.1: Algorithm for the calculation of global forest align-
ment similarity
FA
.
Eciency Analysis
Time Eciency According to the recurrences in Figure 3.4, each S
(x, y)
is calculated in O(degree(F)+degree(G)) time. Elements of table maxcsen
F
and table maxcsen
G
are bounded by degree(F) and degree(G), respec-
tively. Thus, the initialization steps in Line 2-4 and Line 5-7 require O([F[
degree(F)
2
) and O([G[ degree(G)
2
) time. The main loop structure, ranging
from Line 8 to 17, requires O([F[ [G[ (degree(F) (degree(F) +degree(G)) +
degree(G) (degree(F) +degree(G))) time. After rearrangement of the terms,
the overall time complexity of Algorithm 3.1 is O([F[ [G[ (degree(F) +
degree(G))
2
).
72 Algorithms for Global and Local Forest Similarity
Space Eciency The size of the table S
k
i=1
i =
k(k1)
2
dierent,
non-empty csfs where the node index is an element of the sequence. For
all p =
n
k
sequences of sibling nodes the total number of csfs is
n(k1)
2
.
Adding the empty forest gives
n(k1)
2
+ 1. Assume there exist non-empty
sequences of sibling nodes S and S
[ = k. Since
|S|
i=1
i +
|S
|
i=1
i <
|S|+|S
|
i=1
, the number of csfs is less than
n(k1)
2
. This argument
holds recursively if there exist more than two sets of siblings. Clearly, if
n ,= p k there exists a sequence of sibling nodes of size less than k and the
number of csfs is less than
n(k1)
2
+ 1. Thus,
n(k1)
2
+ 1 is an upper bound.
Now, I consider the lower bound. If the forest is a tree that has exactly
one leaf (there is no branching node) then there are n sequences of sibling
nodes, each of length 1. Hence, the number of closed subforests is n. Adding
the empty forest results in n + 1. Every other tree contains at least one set
of siblings with more than one element which increases the number of closed
subforests. Thus, n + 1 is a lower bound.
From Theorem 3.1, it follows that Algorithm 1 runs in O([F[ degree(F)
[G[ degree(G)) space. Note that the asymptotic space complexity is not
reduced by using dense two-dimensional tables. However, the space reduction
can be huge in practice which is measured in Section 4.8.2.
3.3 A Global Forest Alignment Algorithm 73
Reducing the Search Space of Forest Alignments
The number of forest alignments grows exponentially with the number of
nodes in the tree, which is clear since the forest alignment model is a gen-
eralization of the sequence alignment model. In fact, the number of for-
est alignments exceeds the number of sequence alignments, because a forest
alignment is constructed from more combinations of problems. There are
two embeddings of a sequence in a forest. In the rst, the vertical embed-
ding, the parent-child relation of a forest corresponds to the sequence. In the
second, the horizontal embedding, the sibling order reects the sequence, i.e.
a sequence corresponds to a forest consisting solely of leaves. I now observe
how the forest alignment model treats the two embeddings.
Assume a forest F, that begins as a sequence at F
(horizontally or
vertically), and some forest G. The general denition of the search space
(see Lemma 3.1) considers in Case 2 and Case 3 alignments that dier only
in the split position of G whereas both parts are aligned with the empty forest
(because F begins as a sequence). In particular, the score of an alignment is
build additively from the scores of the edit operations relabel , delete, insert.
If F
FA
(F
, G)
FA
(F
, r:G) +
FA
(F
FA
(F
, G)
FA
(F
, r:G) +
FA
(F
), )
+
_
FA
(F
, G) if F
has no child
FA
(F
, G) if F
, r:G)
+
FA
(F
, G:(len(G) k))
otherwise
insert(F, G) = (, label (G
))
+
_
FA
(F, G
) if G
has no child
FA
(F, G
) if G
)
+
FA
(F:(len(F) k), G
)
otherwise
(a)
delete(x, y) = (lb
F
[i], )
+
_
_
S
(
F
(i + 1, noc
F
[i]),
G
(k, l)) if rb
F
[i] = 0
S
(
F
(rb
F
[i], j 1),
G
(k, l) if noc
F
[i] = 0
max
0rl
_
S
(
F
(i + 1, noc
F
[i]),
G
(k, r))
+S
(
F
(rb
F
[i], j 1),
G
(rb
r
G
[k], l r))
otherwise
insert(x, y) = (, lb
G
[k])
+
_
_
S
(
F
(i, j),
G
(k + 1, noc
G
[k])) if rb
G
[k] = 0
S
(
F
(i, j),
G
(rb
G
[k], l 1) if noc
G
[k] = 0
max
0rj
_
S
(
F
(i, r),
G
(k + 1, noc
G
[k]))
+S
(
F
(rb
r
F
[i], j r),
G
(rb
G
[k], l 1))
otherwise
where (i, j) =
1
F
(x) and (k, l) =
1
G
(y)
(b)
Figure 3.5: (a) shows improved recurrence relations for the functions delete and
insert in search space notation. (b) shows the matrix recurrences. The tables
rb
F
, noc
F
, rb
G
and noc
G
can be utilized to check whether a node has a child or
bother node or not.
76 Algorithms for Global and Local Forest Similarity
have interesting applications in the comparison of RNA structures (see Sec-
tion 4.6 and Section 7.1).
Local similarity means nding the maximal similarity between two sub-
structures. If these substructures are extended, the score decreases. This
requires a scoring scheme that balances positive and negative scoring contri-
butions. Otherwise, the similarity of the complete structures would always
achieve the maximum score. It is generally assumed that an alignment of two
empty structures scores zero. A localized variant of distance makes no sense,
as empty forests have always the lowest possible distance of zero. A simi-
lar problem studied for distance based alignments is the similar consensus
problem (see Section 2.5.4).
The question is, what are the substructures in a forest? A substring
of a string is a prex of a sux, and local similarity on strings means the
highest similarity over all pairs of substrings. The problem of nding most
similar (complete) suxes is not of great interest in the domain of strings.
Moving from strings to forests, local similarity problems come in a greater
variety: On trees, the counterpart of a sux is a subtree. Finding the most
similar subtrees is an interesting problem, and for forests it generalizes to
the problem of nding the most similar closed subforests. Continuing the
analogy, the prex of a (sub)tree T is a tree T
and G
CSF
(F, G) = max
FA
(F
, G
) [ F
is a csf of F and G
is a csf of G.
(3.10)
3.4 Local Similarity in Forests 77
For the calculation of global similarity, Algorithm 3.1 calculates the global
similarity between all relevant pairs of closed subforests due to Lemma 3.2.
This algorithm can be easily modied to calculate the similarity between all
combinations of closed subforests by modifying the loop structure.
Iterating over all possible combinations of closed subforests as in Algo-
rithm 3.2 is consistent with the dependencies of the recurrences in Figure 3.4
and Figure 3.5.
Input: Forests F and G, given by tables
lb
F
, lb
G
, noc
F
, noc
G
, rb
F
, rb
G
, oset
F
, oset
G
,
maxcsen
F
, maxcsen
G
Output:
CSF
(F, G) is the maximum value stored in S
(0, 0) 0 1
for i [F[ to 1 do 2
for j 1 to maxcsen
F
[i] do Calculate S
(
F
(i, j), 0) 3
end 4
for k [G[ to 1 do 5
for l 1 to maxcsen
G
[k] do Calculate S
(0,
G
(k, l)) 6
end 7
for i [F[ to 1 do 8
for k [G[ to 1 do 9
for j 1 to maxcsen
F
[i] do 10
for l 1 to maxcsen
G
[k] do 11
Calculate S
(
F
(i, j),
G
(k, l)) 12
end 13
end 14
end 15
end 16
Algorithm 3.2: Algorithm for the calculation of local closed subforest
alignment similarity
CSF
.
Algorithm 3.2 tabulates
FA
(F
, G
and G
of
F and G. Thus, scanning the matrix S
CSF
(F, G). An optimal alignment can be obtained by backtracking.
78 Algorithms for Global and Local Forest Similarity
Space and Time Complexity I apply the same tabulation technique as
for the calculation of global similarity. Hence, the space complexity is the
same as for Algorithm 3.1, O([F[ degree(F) [G[ degree(G)). However, now
each element of S
is calcu-
lated in O(degree(F) + degree(G)) time. Since each entry in S
is calcu-
lated exactly once, the overall time complexity of Algorithm 3.2 depends on
the size of S
of G. That is,
SIL CSF
(F, G) = max
FA
(F, G
) [ G
is a csf of G. (3.11)
Algorithm 3.2 calculates the similarity between all closed subforests and,
since F is also a csf , also between F and all csfs of G. Thus, scanning the
matrix S
is the only
dierence. The time complexity could be further reduced since, in analogy to
the global similarity, not each pair of closed subforests is a relevant subforest.
A combination of Algorithm 3.1 and Algorithm 3.2 that has a loop structure
as in Algorithm 3.1 for forest F and as in Algorithm 3.2 for forest G would
reduce the time complexity. I do not provide a formal analysis since the time
improvement relies on the smaller of both forests and Algorithm 3.2 yields
to good practical runtime (see Section 4.8.2).
3.4 Local Similarity in Forests 79
3.4.3 Suboptimal Solutions
Possibly, there is more than one highly similar region in two forests or that a
smaller forest appears more than once in a larger one. If so, all solutions above
a certain threshold of similarity are of interest. To avoid redundant align-
ments, i.e. alignments that intersect with previously reported solutions are
excluded. I dene csfs F
and F
to be the root
node of the rightmost tree in a forest F. The forest [F[ denotes the forest
F omitting the leftmost and the rightmost tree.
4.2 The Extended Forest Representation of
RNA Secondary Structures
The extended forest representation extends the natural tree representation
of RNA secondary structures
2
such that base-pairs are represented by three
connected nodes: The pair-node, for short P-node, stands for a base-pair
bond and is labeled with P. Its children nodes are ordered according to the
5
to 3
ordering of bases and the leftmost and the rightmost child are the
bases that pair. A node that is not a P-node is a base-node, for short B-node,
and is labeled with one of the bases A,C,G,U. Note that the children nodes
of a P-node can be P-nodes except for the leftmost and the rightmost child.
Hence, a P-node is always an internal node, whereas a B-node is always
a leaf. In this sense, the structure given by the P-nodes is imposed on the
sequence of B-nodes. Figure 4.2 gives an example of the extended RNA forest
representation which is in avor of parse trees for context free grammars that
describe RNA secondary structures [37, 168].
2
Precisely, the forest counterpart.
4.2 The Extended Forest Representation of RNA Secondary Structures 85
ug
gc
gc
au
gc
cg
au
gu
au
gu
gc
cg
ua
c u g g c
a gc
cg
gc
ua
u ua
au
ua
au
ua
au
a c a a g a a a
u
a u
(a)
u
g
g
a
g c a g a g g c u
c u
g
g
c
a g c u u u u g c
a
g
c
g
u
u u
a
u
a
u
a
a
c a
a
g
a
a a
u
a
u
a
u
a
u a
c
g
c
a
u
u
c
c
g
(b)
P
u P
g P
g P
a P
g P
c P
a P
g P
a P
g P
g P
c P
u c u g g c a
g
c
u
u
u
u
g
c
a P
g P
c P
g P
u u P
u P
a P
u P
a P
u P
a a c a a g a a a u
a
u
a
u
a
u a
c
g
c
a u u
c
c
g
(c)
Figure 4.2: (a) shows the natural tree representation of structure (b), and (c)
shows the extended forest representation of structure (b) (in this case a tree).
86
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
P
C P
C P
C C A A A U G
G
G
(a)
P,P
C,U P,P
C,C P,P
C,C ,P
C,C A,A A,A A,A G,U
G,G
G,G
G,G
(b)
P
U P
C P
C P
C A A A G
G
G
G
(c)
Figure 4.3: (a) and (c) show the extended forest representations of the struc-
tures shown in Figure 4.1(a) and 4.1(c), respectively. The alignment (b), which is
optimal for a scoring scheme that favors base matches, accounts for the base-pair
breaking by the deletion of a P-node, a match of the conserved base, and a mismatch
of the mutated base. Thus, the scoring contribution is (, P)+(C, C)+(G, U).
4.3 The Welformed Alignment Model
An alignment of forests in the extended forest representation is a simul-
taneous alignment of sequence and structure. The sequence and structure
alignment mutually improves the quality of the whole sequence-structure
alignment. Figure 4.3 shows how a forest alignment using the extended for-
est representation can account for the dierences of the structures studied in
Figure 4.1.
However, a straightforward application of the classical forest alignment
model (see Section 3.2) to the extended forest representation of RNA sec-
ondary structures results in the following problems:
1. A match of a P-node with a B-node cannot be interpreted as an edit
operation on RNA structures.
2. It is not guaranteed by the model that, once a P-node is matched, the
corresponding paired bases are relabeled by each other. See Figure 4.4
for an example of an alignment that should be avoided.
3. The score of a base-pair replacement is built from the independent
scores of a P-node match and the matches (or replacements) of the
4.3 The Welformed Alignment Model 87
P
G C C C U
(a)
P
G U U U C
(b)
P,P
G,G C,U C,U C, U,U ,C
(c)
P,P
G,G C,U C,U C,U U,C
(d)
Figure 4.4: Consider a feasible scoring scheme that gives better scores to base
matches than to mismatches and indels. (c) shows an alignment of (a) and (b) that
is not welformed, i.e. the base-pairs GU and GC are not aligned on the sequence
level, though the corresponding P-nodes are matched. (d) shows a welformed tree
alignment of the same structures where paired bases are aligned to each other.
paired bases. Thus, an empirical derived scoring scheme based on base-
pair substitution frequencies as proposed by Klein & Eddy cannot be
used [103].
Case 1 can be easily avoided by assigning an innite negative score to a
relabeling of a P-node with a B-node. The limitations explained in Case 2
and Case 3 result from the following fact: A base-pair replacement is not an
elementary edit operation in the tree edit model but aects three nodes in the
extended forest representation. In analogy to the base-pair replace operation
in the general edit model for RNA secondary structures (see Section 2.5.5), I
introduce a base-pair replace operation for the extended forest representation.
I extend Tais edit model (see Section 2.5.3) introducing a new edit operation
for the P-nodes in the extended forest representation. Intuitively, this means
whenever a P-node is matched, the corresponding paired bases are relabeled
by each other. I rene the relabel function by introducing two new edit
operations that can be applied to either P-nodes or B-nodes:
basepair relabel : Two P-nodes are matched and their leftmost and
rightmost children are relabeled by each other.
base relabel : The label of a B-node is replaced by another, possibly the
88
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
same, B-node label.
An alignment that results from this extended edit model is denoted a welformed
alignment. It is obvious that the following holds:
An alignment A of forests F and G is a welformed forest alignment i
for all relabeled P-nodes in A the corresponding B-nodes are relabeled in A.
(4.1)
The welformed tree alignment similarity
WFA
is dened as:
WFA
(F, G) = max(A) [ A is a welformed forest alignment of F and G.
(4.2)
4.4 A Global RNA Secondary Structure Align-
ment Algorithm
4.4.1 The Search Space of RNA Secondary Structure
Alignments
Remember the observations concerning the search space of forest alignments
in Lemma 3.1. Clearly, the search space denitions of Case 2 (delete) and
Case 3 (insert) are not aected by the constraints that make an alignment a
welformed alignment. Case 1 treats the relabeling of nodes and I rene this
case to be consistent with the denition of welformed forest alignments. The
following case analysis replaces Case 1 in Lemma 3.1 resulting in a denition
of the search space for welformed forest alignments.
Lemma 4.1. Let A be a welformed alignment of forests F and G in the
extended forest representation such that (a, b) = label (A
) and a, b , . Let
(c, d) = A
and (e, f) = A
.
1. if label (A
[F
[ [G
[
[A
[
P P
(P, P)
(c, d) (e, f)
c
d
e f
Figure 4.5: Search space implications of Lemma 4.1 are indicated by the lines
connecting alignment A and forests F and G. The lines connecting [A
[, [F
[
and [G
[ are omitted.
label (F
) = P and label (G
) = P,
c = F
, d = F
, e = G
and f = G
,
[A
[ is an alignment of [F
[ and [G
[.
2. if label (A
) ,= (P, P) then A
is an alignment of F
and G
.
Proof. Case 1 follows directly from the denition of welformed alignments.
Case 2 handles the relabeling of B-nodes and is a special case of Case 1 in
Lemma 3.1. In this case, F
and G
F[, [
G[)
which are not relevant due to Lemma 3.2 and, hence, are not calculated by
90
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
Algorithm 3.1. The following Lemma identies the relevant subforest for the
calculation of
WFA
(F, G).
Lemma 4.2. The pairs of subforests that are relevant for the calculation of
WFA
(F, G) due to the combined recurrences in the Figures 3.2 and 4.6 have
the form (
F:j,
k
]
G[
l
), (
i
]
F[
j
,
G:l), ([
F[:j,
k
]
G[
l
), and (
i
]
F[
j
, [
G[:l) where
F
and
G are maximal closed subforests of F and G, respectively.
Proof. I consider all possible transitions to subforests resulting from the def-
inition of the search space of welformed forest alignments. These are:
Case 1 (from Lemma 4.1): ([F
[, [G
[), (F
, G
)
Case 2 (from Lemma 3.1): (F
, k:G), (F
, G:l)
Case 3 (from Lemma 3.1): (i:F, G
), (F:j, G
)
Referring to the transition graph in the proof of Lemma 3.2, all pairs of sub-
forests (
F:j,
k
]
G[
l
) and (
i
]
F[
j
,
G:l) can be reached by transitions to subforests,
even though there is no (F
, G
[, [G
[)
results in pairs of forests of the form ([
F[, [
F[:j,
k
]
G[
l
) and (
i
]
F[
j
, [
_
(label (F
), label (G
))
+(label (F
), label (G
))
+(label (F
), label (G
))
+
WFA
([F
[, [G
[)
+
WFA
(F
, G
)
if label (F
) = P and label (G
) = P
(label (F
), label (G
))
+
WFA
(F
, G
)
otherwise
(a)
relabel (x, y) =
_
_
(lb
F
[i], lb
G
[k])
+(lb
F
[i + 1], lb
G
[k + 1])
+(lb
F
[rmb
F
[i]], lb
F
[rmb
G
[k]])
+S
(
F
(rb
F
[i + 1], noc
F
[i] 2),
G
(rb
G
[k + 1], noc
G
[k] 2))
+S
(
F
(rb
F
[i], j 1),
G
(rb
G
[k], l 1))
if lb
F
[i] = P and lb
G
[k] = P
(lb
F
[i], lb
G
[k])
+S
(
F
(rb
F
[i], j 1),
G
(rb
G
[k], l 1))
otherwise
where (i, j) =
1
F
(x) and (k, l) =
1
G
(y)
(b)
Figure 4.6: Recurrences for the rened relabel function for welformed forest align-
ments resulting from the search space observations in Lemma 4.1. (a) shows the
recursive formula in search space notation. (b) shows the corresponding matrix
recurrences using the tabulation technique introduced in Section 3.3.2.
92
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
The Order of Calculation
A straightforward solution would be to use Algorithm 3.2 with the rened
relabel recurrences in Figure 4.5 (b). However, for the calculation of global
similarity this would increase the time complexity unnecessarily since Al-
gorithm 3.2 calculates the similarity between all pairs of closed subforests.
Algorithm 4.1 calculates only the closed subforests that are relevant for the
calculation of the global welformed forest alignment similarity. This algo-
rithm is obtained by including the missing calculations for the welformed
alignment model in Algorithm 3.1.
Eciency Analysis
The space and time eciency for the calculation of
WFA
(F, G) is the same
as for the corresponding version of the classical forest alignment similar-
ity (see Section 3.3.2). This is obvious since the rened relabel operation
can be calculated in constant time, the loop structure of Algorithm 3.1 and
Algorithm 4.1 is the same, and the same tabulation technique is used. Re-
stating the complexity results, the time complexity for the calculation of
WFA
(F, G) is O([F[ [G[ (degree(F) +degree(G))
2
) and the space complex-
ity is O([F[ degree(F) [G[ degree(G)). In Section 4.8.2, I will observe
how the critical parameters, size and degree, scale for the extended forest
representation of RNA secondary structures.
4.5 Scoring Schemes
The extended forest representation is suitable to score both structure and
sequence similarity. The structural edit operations aect the P-nodes and se-
quence edit operations aect the B-nodes. I will present two scoring schemes:
First, a pure structure based scoring where sequence information is neglected
or only contributes marginally to the overall score, such that structure is
dominating. The scoring contributions of a base-pair is built from the in-
dependent scores of the aligned P-node and B-nodes. Second, I employ a
4.5 Scoring Schemes 93
Input: Forests F and G, stored as tables
lb
F
, lb
G
, noc
F
, noc
G
, rb
F
, rb
G
, oset
F
, oset
G
,
maxcsen
F
, maxcsen
G
, rmb
F
,rmb
G
Output:
WFA
(F, G) stored at
S
(
F
(i, maxcsen
F
[i]),
G
(k, maxcsen
G
[k]))
S
(0, 0) 0 1
for i [F[ to 1 do 2
for j 1 to maxcsen
F
[i] do Calculate S
(
F
(i, j), 0) 3
end 4
for k [G[ to 1 do 5
for l 1 to maxcsen
G
[k] do Calculate S
(0,
G
(k, l)) 6
end 7
for i [F[ to 1 do 8
for k [G[ to 1 do 9
for j 1 to maxcsen
F
[i] do 10
Calculate S
(
F
(i, j),
G
(rb
G
[k], maxcsen
G
[k] 1)) 11
Calculate S
(
F
(i, j),
G
(k, maxcsen
G
[k])) 12
end 13
for l 1 to maxcsen
G
[k] do 14
Calculate S
(
F
(rb
F
[i], maxcsen
F
[i] 1),
G
(k, l)) 15
Calculate S
(
F
(i, maxcsen
F
[i]),
G
(k, l)) 16
end 17
end 18
end 19
Algorithm 4.1: Algorithm for the calculation of welformed global for-
est alignment similarity
WFA
. The calculation of elements of S
in-
cludes the recurrences in Figure 4.6 (b). In comparison to Algorithm
3.1, there are additional computations in Line 11 and Line 15 for the
similarity calculations of the pairs ([
F[: j,
k
]
G[
l
) and (
i
]
F[
j
, [
G[: l).
Note that if, e.g. in Line 11, k has no right brother, the term
G
(rb
G
[k], maxcsen
G
[k] 1) evaluates to
G
(0, 0) which is the empty
forest. This calculation is not necessary since the alignments involving
the empty forest are already calculated in Line 1-7. However, the al-
ternative would be a case distinction in the loop structure which makes
the algorithm more complicated.
94
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
A C G U P
A 1
C 0 1
G 0 0 1
U 0 0 0 1
P 10
-10 -10 -10 -10 -5 n.d.
Figure 4.7: Scoring values for the scoring function
P
. Since scoring functions
are generally considered to be symmetric, a triangle matrix is sucient to dene
the scoring function. The substitution of by is not dened since it never
happens in an alignment model.
scoring scheme based on empirically derived substitution scores for aligned
bases and basepairs, the RIBOSUM score. In this scoring scheme, the
aligned bases in a base-pair replacement are considered simultaneously.
4.5.1 Pure Structure Alignment Score
A pure structure alignment of RNA secondary structures is an alignment due
to a scoring scheme that is guided by the structure and not by the sequence.
However, it makes sense that, especially in aligned loop regions, sequence
information can be used to improve the results. Therefore, I give a positive
score to a base-match which is much smaller than the score for a base-pair
match, or precisely, the match of a base-pair bond represented by a P-node.
The score of a base-pair replacement is built from the match of two P-nodes
plus the replacement scores of the involved bases (refer to the recurrences
in Figure 4.6). Clearly, the deletion of base-pair bonds (P-nodes) and the
deletion of bases (B-nodes) should be penalized, but the deletion of a base-
pair bond should not cost as much as the deletion of a base. Based on these
considerations, I dene the scoring function
P
given by the scoring matrix
in Figure 4.7.
4.5 Scoring Schemes 95
4.5.2 RIBOSUM Scores
The scoring of sequence alignments received much attention and a good scor-
ing scheme is a prerequisite to produce biological meaningful alignments es-
pecially for protein sequences. For sequences, log-odds position independent
substitution matrices were successfully applied to compute the alignment
scores. Most prominent are BLOSUM
3
and PAM
4
matrices [31, 75]. The
former is generally acknowledged to produce better results for evolutionary
distantly related sequences.
Recently, Klein & Eddy generalized Heniko & Henikos BLOSUM idea
to structural RNA resulting in two substitution matrices: One for unpaired
bases and one for base-pairs. According to the BLOSUM matrixes, they
called their scoring matrices RIBOSUM (RIBOsomal RNA SUbstitution
Matrix) [103]. I now resemble the idea for the calculation of RIBOSUM
matrices and how they can be used in my structure comparison algorithms.
The substitution scores are empirically derived from hand-crafted high-
quality alignments of the small subunit RNA from the European Ribosomal
RNA Database [32]. The scoring matrices give the log-odds ratio for observ-
ing a given substitution relative to background nucleotide frequencies. For
single base substitutions this is a 4 4 matrix S given by
s
ij
= log
2
f
ij
g
i
g
j
, (4.3)
where i and j are the two aligned nucleotides, f
ij
is the empirically observed
frequency of i aligned to j in homologous RNAs, and g
i
and g
j
are the back-
ground frequencies of the individual nucleotides. For base-pair substitutions
this is a 16 16 matrix S
given by
s
ijkl
= log
2
f
ijkl
g
i
g
j
g
k
g
l
, (4.4)
3
BLOck SUbstitution Matrix
4
Percent Accepted Mutations
96
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
where i is base-paired to j, k is base-paired to l, i is aligned with k, and j
is aligned with l. In this case, f
ijkl
is the observed frequency of the two base
pairs ij and kl aligned to each other in homologous RNAs. g again is the
background frequency of the individual nucleotides.
A naive counting of the frequencies f
ij
and f
ijkl
could bias the substitution
scores towards overrepresented sub-families in the alignment. To eliminate
this risk, clusters of similar sequences are formed that weight the individual
sequences in the alignment, i.e. a member of a large cluster has a small
weight. A single linkage clustering technique groups sequences with a per-
centage identity above some threshold x. To allow shorter evolutionary dis-
tances than the original BLOSUMmatrices, Klein & Eddy added a second
sequence identity cuto y. Only pairs of sequences that exceed y percent
identity are counted at all. By adjusting x and y, a specic RIBOSUMx-y
matrix can be constructed. Klein & Eddy observed that the RIBSOUM85-60
matrix (see Figure 4.8) is a good ab initio choice.
The recurrences in Figure 4.6 can be easily adapted to score base-pair
substitutions. Instead of adding the scores for a P node and the aligned
bases, a scoring function
BP
accepts the whole base-pairs as its parameter.
In particular in Figure 4.6 (a) the terms
(label (F
), label (G
)) + (label (F
), label (G
)) + (label (F
), label (G
))
are substituted by
BP
(label (F
)label (F
), label (G
)label (G
)).
The corresponding holds for the table based recurrences in Figure 4.6 (b). A
problem that remains is to set the score for P-node and B-node deletions. I
set both to 2. However, these parameters should be empirically adjusted
for the particular application.
4.6 Local Similarity in RNA Structures 97
A C G U
A 2.22
C -1.86 1.16
G -1.46 -2.48 1.03
U -1.39 -1.05 -1.74 1.65
(a)
AA AC AG AU CA CC CG CU GA GC GG GU UA UC UG UU
AA -2.49
AC -7.04 -2.11
AG -8.24 -8.89 -0.80
AU -4.32 -2.04 -5.13 4.49
CA -8.84 -9.37 -10.41 -5.56 -5.13
CC -14.37 -9.08 -14.53 -6.71 -10.45 -3.59
CG -4.68 -5.86 -4.57 1.67 -3.57 -5.71 5.36
CU -12.64 -10.45 -10.14 -5.17 -8.49 -5.77 -4.96 -2.28
GA -6.86 -9.73 -8.61 -5.33 -7.98 -12.43 -6.00 -7.71 -1.05
GC -5.03 -3.81 -5.77 2.70 -5.95 -3.70 2.11 -5.84 -4.88 5.62
GG -8.39 -11.05 -5.38 -5.61 -11.36 -12.58 -4.66 -13.69 -8.67 -4.13 -1.98
GU -5.84 -4.72 -6.60 0.59 -7.93 -7.88 -0.27 -5.61 -6.10 1.21 -5.77 3.47
UA -4.01 -5.33 -5.43 1.61 -2.42 -6.88 2.75 -4.72 -5.85 1.60 -5.75 -0.57 4.97
UC -11.32 -8.67 -8.87 -4.81 -7.08 -7.40 -4.91 -3.83 -6.63 -4.49 -12.01 -5.30 -2.98 -3.21
UG -6.16 -6.93 -5.94 -0.51 -5.63 -8.41 1.32 -7.36 -7.55 -0.08 -4.27 -2.09 1.14 -4.76 3.36
UU -9.05 -7.83 -11.07 -2.98 -8.39 -5.41 -3.67 -5.21 -11.54 -3.90 -10.79 -4.45 -3.39 -5.97 -4.28 -0.02
(b)
Figure 4.8: RIBOSUM85-60 matrix. Watson-Crick base-pairs substitutions are
emphasized.
4.6 Local Similarity in RNA Structures
In Section 3.4, two local similarity algorithms for forests were presented: The
closed subforest similarity
CSF
and the small-in-large similarity
SIL CSF
.
Since I represent RNA secondary structures as forests, these local similarity
algorithms can be used directly to nd local similarities in RNA secondary
structures. The local substructures that are considered are closed subforests
of the forest representation. Figure 4.9 explains what kind of substructures
of RNA secondary structures corresponds to closed subforests.
Algorithm 3.2 calculates both the local similarities
CSF
and
SIL CSF
in
RNA secondary structures in the extended forest representation. The only
dierence is that the recurrences for the relabel function in welformed forest
alignment (Figure 4.6) replace the relabel function in the classic forest align-
ment model (Figure 3.4). The control structure of Algorithm 3.2 remains the
same and so does the eciency.
98
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
u
g
g
a
g c a g a g g c u
c
u
g
g
c
a g c u u u u g c
a
g
c
g
u
u u
a
u
a
u
a
a
c
a
a
g
a
a
a
u
a
u
a
u
a
u a
c
g
c
a
u
u
c
c
g
(a)
P
u P
g P
g P
a P
g P
c P
a P
g P
a P
g P
g P
c P
u c u g g c a
g
c
u
u
u
u
g
c
a P
g P
c P
g P
u u P
u P
a P
u P
a P
u P
a a c a a g a a a u
a
u
a
u
a
u a
c
g
c
a u u
c
c
g
(b)
Figure 4.9: Closed subforests correspond to closed substructures in RNA sec-
ondary structures. Intuitively, this means that the substructures are contiguous
and closed by hairpin loops. Closed subforests are sequences of consecutive sub-
trees. In my denition, a subtree contains all edges and nodes emanating from its
root. The blue regions shows a substructure in (a) that corresponds to a closed sub-
forest in (b). The green part of the structure is not a closed subforest because the
descending nodes are not included, it is not closed. The red part shows a substruc-
ture that is not considered as a local structure for the same reason. However, this
is less obvious since only the U, which is a child of the root of this subtree, is not
included. If the top-level P node would not be included in the red substructure, this
part would correspond to a closed subforest. The yellow part does not correspond
to a closed subforest since the subtrees are not consecutive siblings.
4.7 Visualization of RNA Secondary Structure Alignments 99
4.7 Visualization of RNA Secondary Struc-
ture Alignments
4.7.1 ASCII Representation
The ASCII representation of an RNA secondary structure alignment extends
the sequence alignment representation that arranges the aligned sequences
on top of each other. Essentially, it is an alignment of Vienna strings (see
Section 2.3) and there is a gap in the structure line i there is a gap in the
sequence line. See the following example where a * character highlights
sequence or structure conservation:
Input
> alanine
ggggcuauagcucagcugggagagcgcuugcauggcaugcaagaggucagcgguucgaucccgcuuagcuccacca
(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....
> leucine
gccgaaguggcgaaaucgguagacgcaguugauucaaaaucaaccguagaaauacgugccgguucgaguccggccu
(((((((..(((...........))).(((((.......))))).(((....)))..(((((.......)))))))
ucggcacca
)))))....
Output
alanine ggggcuauagcucagcugggag-agcgcuugcauggcaugcaagag--g---u-c
leucine gccgaaguggcgaaaucgguagacgcaguugauucaaaaucaaccguagaaauac
* * * ** * ** ** ** *** * * *** * * * *
alanine --agcgguucgaucccgcuuagcuccacca
leucine gugccgguucgaguccggccuucggcacca
******** *** * *****
alanine (((((((..((((........)-))).(((((.......)))))..--.---.-.
leucine (((((((..(((...........))).(((((.......))))).(((....)))
************ ******** ********************** *
alanine --(((((.......))))))))))))....
leucine ..(((((.......))))))))))))....
****************************
100
Pairwise Comparison of RNA Secondary Structures in the Forest Alignment
Model
4.7.2 2d-Plot
RNA secondary structures are represented graphically as circle plots, dot
plots, mountain plots or 2d-plots (refer to Section 2.3). I present a 2d-
plot variant for RNA secondary structure alignments that emphasizes both
sequence and structure similarity. I follow the strategy of using well estab-
lished layout algorithms for 2d-plots of RNA secondary structure [14, 109,
143, 179, 234]
5
. Therefore, I derive a secondary structure from a structure
alignment which is drawn and annotated further. Since bases paired in a
structure S
1
can be aligned to bases unpaired in a structure S
2
, the presenta-
tion of a common secondary structure leaves some choice. For an alignment
A of structures S
1
and S
2
, I draw an RNA secondary structure S
2
-at-S
1
) is an alignment of forests F
1
, F
2
, . . . , F
n
T() i
F
i
= (A[
i
) for i [1, n]. (5.1)
Note that the labels of a multiple forest alignment are n-tuples. Figure
5.1 shows an example of an alignment of four RNA secondary structures in
the extended forest representation. As an optimization criterion, a scoring
function for multiple forest alignments is required. The sum-of-pairs score
introduced by Carillo & Lipman [15] denes the score of each column of
a multiple sequence alignment as the sum of scores of all combinations of
pairwise scores for the column. Let n be the number of columns which
5.3 A Forest Prole for RNA Secondary Structures 121
corresponds to the number of aligned forests in a forest alignment A. The
sum-of-pairs (SP) score is dened formally as:
SP
(A) =
v node in A
1pqn
(label (v)
p
, label (v)
q
). (5.2)
As I represent RNA secondary structures in the extended forest representa-
tion, I concentrate on the welformed alignment similarity
WFA
(see Section
4.3). The denition of welformed forest alignments that was given in the con-
text of pairwise alignments (see Denition (4.1)) applies to the multiple case
as well. I dene the alignment similarity
WFA
between forests F
1
, F
2
, . . . , F
n
as the maximum SP-score that a welformed alignment of F
1
, F
2
, . . . , F
n
can
achieve
1
.
WFA
(F
1
, F
2
, . . . , F
n
) = max(A) [ A is a welformed alignment of F
1
, F
2
, . . . , F
n
.
(5.3)
5.3 A Forest Prole for RNA Secondary Struc-
tures
Multiple alignments of protein sequences are useful to group proteins of sim-
ilar functions into protein families. The identication of proteins that also
belong to a certain family gives naturally rise to the question of nding a
kind of representative sequence for a protein family. Such representations,
that are well known for multiple sequence alignments, are prole and con-
sensus sequence. Here I restrict my attention to the prole representation
[66]. A prole for a multiple sequence alignment consists of the frequencies
of characters in each row and is also known as a weight matrix.
In analogy to view sequence alignments as sequences of edit operations,
1
The generalization of the classic forest alignment model follows analogously.
122 Multiple Alignment of RNA Secondary Structures
G P
A P
U P
A P
C A A A A G
U
A
U
A
g
a
u
a
c
a
a a
a
g
u
a
u
a
F1
P
A P
A G G P
A P
C A A A A G
U
U
U
a
a
g
g
a
c
a
a
a
a
g
u
u
u
F2
,,P,
G,,G,G P,P,P,P
A,A,A,A P,P,P,P
U,A,U,U ,G,,G ,G,,G P,P,,P
A,A,,A ,,P,P
,,C,C P,P,P,P
C,C,C,C A,A,A,A A,A,A,A A,A,C,C A,A,A,A G,G,G,G
,,G,G
U,U,,U
,,A,A ,,A,A A,U,A,A
U,U,U,U
A,,C,U
g
a
u
g
g
a
c
c
a
a a
a
g
g
u
a
a
a
u
a
A
P
G P
A P
U P
C P
C A A C A G
G
A A A
U
C
g
a
u
c
c
a
a
c
a
g
g
a
a
a
u
c
F3
G P
A P
U G G P
A P
C P
C A A C A G
G
U
A A A
U
U
g
a
u
g
g
a
c
c
a
a c
a
g
g
u
a
a
a
u
u
F4
(A[1) (A[2)
(A[3)
(A[4)
Figure 5.1: A is an alignment of F
1
, F
2
, F
3
, F
4
. The 2d-plot of the secondary
structure is shown at the top right corner of the forests. The 2d-plot for the align-
ment A will be explained in Section 5.3.1. Intuitively, it shows an overlay of the
single structures.
5.3 A Forest Prole for RNA Secondary Structures 123
and tree alignments as trees labeled with edit operations (see Section 2.5.3), I
consider a prole for a sequence alignment as a sequence of relative frequency
vectors. Consequently, a prole for a forest alignment is a forest labeled with
relative frequency vectors. Let k = [ [, I give the following denition
of a prole for a forest alignment:
Given a multiple forest alignment A T(
n
SP
of two relative frequency vectors p, q IR
k
is dened as follows:
SP
(p, q) =
(a,b)
2
p
a
q
b
(a, b). (5.5)
Unlike for distances where the score of two equal forests is zero, the similarity
value can be an arbitrary positive value. The similarity score of two equal
forests of size n can be the same as for two dierent forests of size m where
m > n. Therefore, I introduce relative scores that are upper bounded by 1.
The relative similarity score
SP REL
of forests F
1
and F
2
is dened as:
SP REL
(F
1
, F
2
) =
2
SP
(F
1
, F
2
)
SP
(F
1
, F
1
) +
SP
(F
2
, F
2
)
(5.6)
The self-similarity score of a forest results from a perfect matching alignment
for reasonable scoring schemes. This score can be computed, without self-
aligning the forests, in O([F[). It is simply the sum of the self-relabeling
124 Multiple Alignment of RNA Secondary Structures
,,P,
G,,G,G P,P,P,P
A,A,A,A P,P,P,P
U,A,U,U ,G,,G ,G,,G P,P,,P
A,A,,A ,,P,P
,,C,C P,P,P,P
C,C,C,C A,A,A,A A,A,A,A A,A,C,C A,A,A,A G,G,G,G
,,G,G
U,U,,U
,,A,A ,,A,A A,U,A,A
U,U,U,U
A,,C,U
(a)
_
_
_
_
_
_
_
_
0
0
0
0
0.25
0.75
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0.75
0
0
0.25
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0
1
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
0
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0
1
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.25
0
0
0.75
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0.5
0
0
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0.5
0
0
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0
0.75
0.25
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.75
0
0
0
0
0.25
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0
0.5
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0.5
0
0
0
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0
1
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
1
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
0
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
0
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.5
0.5
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
1
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
1
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
1
0
0
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0.75
0
0.25
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.5
0
0
0
0
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.5
0
0
0
0
0.5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.75
0
0
0.25
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
1
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0.25
0.25
0
0.25
0
0.25
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
A
C
G
U
P
_
_
_
_
_
_
_
_
(b)
Figure 5.2: (a) shows a multiple tree alignment for the extended forest represen-
tation of RNA secondary structures and (b) its corresponding prole. The rows
of the frequency vectors stand, from top to bottom, for the frequencies of the sym-
bols A,C,G,U,P,. Note that the frequency of a base is zero i the frequency of a
base-pair bond is greater than zero.
scores for each node. A new prole-forest can be constructed in O([F[)
from an alignment of proles using the weighted mean values of the aligned
frequency vectors as the frequency vector of the prole. The number of
forests in the aligned proles determines the weight. Formally, let n and m
be the number of aligned forests for the proles P
1
and P
2
, respectively. For
each relabeled node in the alignment of P
1
and P
2
, I dene p
1
and p
2
as the
aligned frequency vector in P
1
and P
2
, respectively. For each insertion and
deletion node, I dene the p
1
and p
2
to be the frequency vector of a gap node,
the vector (0, 0, 0, 0, 0, 1)
T
. The combined frequency vector p
a
is calculated
5.3 A Forest Prole for RNA Secondary Structures 125
as follows:
p
a
=
n
n + m
p
1
+
m
n + m
p
2
(5.7)
Note that also single structures in the extended forest representation can be
converted to a corresponding prole.
5.3.1 Visualization of RNA Secondary Structure Pro-
les
A prole for a forest alignment can represent multiple RNA secondary struc-
tures and gives rise to the question to nd a consensus structure. Since bases
paired in one structure can be aligned to bases unpaired in another structure,
this leaves some choice.
An RNA prole is compatible to a single structure if the leftmost and
rightmost child of each P-node is a B-node. A compatible prole can be
obtained from an arbitrary prole by deleting P-nodes. An optimal consensus
structure corresponds to a compatible prole that maximizes the sum of P-
node frequencies.
I provide a console output and a 2-d plot visualization for consensus
sequence and structure.
ASCII Representation
In the ASCII representation, I draw the consensus sequence on top of the
consensus structure. The height of * symbols on top and below the con-
sensus sequence-structure gives the frequency of bases and base-pairs in the
consensus. Each * means 10% frequency. Sequence and structure conser-
vation and the relation between them can be read from this arrangement.
An example is given in Figure 5.3.
2d-plot
The 2d-plot visualization for RNA secondary structures was rened to visu-
alize consensus structures by adding reliability information to the drawing
126 Multiple Alignment of RNA Secondary Structures
* * **** ****
* * **** ****
** * **** **** *
** * **** * **** *
** * **** ******** **** **** *
**************** ** * **** ******** **** **** *
**************** ** * **** ******** **** **** ****
**************** ** * **************** ****************
**************** ** **************************************
**************** ** **************************************
ggggcuauagcucagcugggggagcuauagcucagcugggagcggggauagcuuaacc
.((((....))))....((.(.(((((..((((........))))...))))))..))
**********************************************************
**************** ** **************************************
**************** ** ** ***********************************
**************** ** * *************** *********** *****
** * **** ******** ***** **** * **
** * **** ******** ***** **** * **
** * **** ******* *** * **** * **
** * **** * **** * **
* * **** **** * *
* * **** **** * *
Figure 5.3: ASCII representation of a consensus sequence and structure.
using some color schemes [24, 105]. I draw the consensus structure in two
forms that dier only in the way sequence information is included. Both
express the frequency of base-pairs and the presence (or absence) of gaps as
a gradient from light grey to black. A base-pair is only drawn if it is present
in at least fty percent
2
of the structures. With respect to sequence infor-
mation, I provide either the most frequent base at each residue, or indicate
the base and gap frequencies in an arrangement of colored dots
3
. In contrast
to others, my visualization includes the full sequence information of the con-
sensus structure. This information is useful to relate structure and sequence
2
This parameter is adjustable.
3
This arrangement was the result of a discussion with Peter Stadler who I gracefully
acknowledge here.
5.3 A Forest Prole for RNA Secondary Structures 127
g
g
g
g
g
g
g u
a
g
c
u
c
a
g
u
g
g
g
u
a a
a
a
g
c
a
c
c
g
g
a c
u
g
a
a
a
a
u
c
c
g
g
c g
g
g
u
a
g
a
c
a
a
c
u
a
c
g
u
c
g
g
g
g
g
u u
c
g
a
a
u
c
c
c
c
c
c
c
c
c
c
c
c
a
c
c
a
(a) (b)
Figure 5.4: 2d-plots of a multiple alignment of 20 secondary structures of E.coli
tRNAs. In both plots, the consensus structure is shown. The lighter a base-pair
bond is drawn, the less frequent does it exist in the structures. Bases or base-
pair bonds that have a frequency of one are drawn in red. The darkness of the
lines connecting adjacent bases (the backbone, not base-pairs) is proportional to
the product of frequencies that there is no gap at the residues. Again, if there is no
gap the connecting line is drawn red. (a) The most frequent base at each residue is
printed with the base frequency indicated by greyscale. (b) The frequencies of the
bases a,c,g,u are proportional to the radius of circles that are arranged clockwise
on the corners of a square, starting at the upper left corner. Additionally, these
circles are colored red, green, blue, magenta for the bases a,c,g,u, respectively. The
frequency of a gap is proportional to a black circle growing at the center of the
square.
conservation. Figure 5.4 shows an example of these prole drawings.
128 Multiple Alignment of RNA Secondary Structures
5.4 A Progressive Prole Algorithm
The alignment based comparison of RNA secondary structures allows to har-
ness multiple sequence alignment strategies almost straightforward. Here, I
introduce an algorithm that is inspired by the progressive calculation of mul-
tiple sequence alignments as in ClustalW [200]. In contrast to ClustalW,
my algorithm does not calculate a guide tree solely based on initial pairwise
similarities. It has been observed (A. Dress, personal communication) that
any such phylogeny tends to reproduce the guide tree, no matter how well
this tree really suits the data.
My strategy is as follows: As in ClustalW, I start with the computation
of all pairwise prole distances. From these comparisons, the proles with
the highest similarities are merged and (unlike ClustalW) the similarity of
the new combined prole to all other proles is calculated. This procedure
is repeated until only one prole is left.
A well known problem of the progressive strategy is that errors made
early in an alignment cannot be rectied when further sequences are added.
To reduce the greediness of my strategy, I do not simply merge the pair of
proles that obtains the highest similarity, but consider also the similarity
to, and between, other proles. Therefore, a maximum weighted matching
between proles is calculated in each step: Consider proles as vertices in
a graph. Each pair of vertices is connected by an edge that has a weight
corresponding to the similarity of the proles. A maximum weighted matching
is a subset of edges such that no two edges share a common endpoint and the
sum of edge weights is maximal. I use Gabows N-cubed weighted matching
algorithm to nd the best matching pairs of proles [51]. The pair of proles
with the highest similarity according to this matching is the one that is
merged. Algorithm 5.1 computes a multiple forest alignment according to
the proposed strategy.
To facilitate joining multiple pairs of proles in Step 4, the algorithm could
join the best n pairs or all pairs that exceed a certain similarity threshold.
Figure 5.5 shows an example of a progressive prole alignment of RNA sec-
5.4 A Progressive Prole Algorithm 129
Input: Forests F
1
, F
2
, . . . , F
n
in the extended forest representation.
Output: A prole forest P for the multiple alignment of F
1
, F
2
, . . . , F
n
Convert F
1
, . . . , F
n
into single structure proles P
1
, . . . , P
n
. 1
Construct all
n(n1)
2
pairwise relative similarity scores
SP REL
of 2
P
1
, . . . , P
n
.
Compute a maximum weighted matching M for the pairwise 3
similarities.
Choose P
i
and P
j
of maximal similarity according to M, compute 4
their alignment P
ij
, and replace both by P
ij
.
Compute the relative similarity score of P
ij
with all others. 5
Iterate Steps 3 to 5, until only a single prole alignment P
1...n
is left. 6
Algorithm 5.1: Progressive prole alignment of forests.
ondary structures.
A related strategy has been suggested by Torsello et al. [204, 205]. In con-
trast to the forest alignment similarity, they compute the tree edit distance
between trees. In the progressive calculation, they merge trees based on the
edit distance mapping. As was shown in Section 2.5.3, such a mapping is
not always consistent with a consensus tree structure. Torsello et al. simply
reject the merge and search for another pair of trees to be merged.
130 Multiple Alignment of RNA Secondary Structures
G
C
C
G
C
C
G
A
G
A
G
G
U
G
G
C
G
U
C
C C C G A C G C C
U
C
A
U
G G G U C G G G
A
A
C
G
ACUG
A
G
A
C
GG
G
C
A
C
C
G
G
U
C
G
U
G
U
C
C
G
G
U
G
C
C
G
C
U
C
G
U
A
U
C
C
A U
U
U
UG
C
U
C C
U
U G
G
A
G
G
A U
U
U
G
G
C
U A U G G
C
C
G
C
G
A C
A
A
G
C
G
G
U
C C G G G C G C C
C
U
A
G
G G G C C C G G
C
G
G
A
G
A
C
G
G
G
C
G
C
C
G
G
A
G
G U
G
U
C
C
G
A
C
G
C
C
U
G
C
U
C
G
U
A
C
C
C
A U
C
U
UG
C
U
C C
U
U G
G
A
G
G
A U
U
U
G
G
C
U A U G
G
C
C
G
C
G
A C
A
A
G
C
G
G
U
C C G G G C G C C C
U
U
C
G
G G G G C C C G G
C
G G
A
G
A
C
G
G
G
C
G
C
C
G
G
A
G
G U
G
U
C
C
G
A
C
G
C
C
U
G
C
U
C
G
U
A
C
C C
A
U
C
U
UG
C
U C
A
G U
G
G
A
G
G
A
U U
G
G
C
U A U G
G
C
C
A
C
G
C
G
G
G
A
G
C
G
U
G
G
C
G C C G G A U G C C
U
C
U C
G G G U C C G G C
A
G
A
C G
C
C
A
U
C
G
G
G
C
A G
A
A
U
C
C
G
A
U
GU
U
G
C
U
C
C
U
A
U
C
C
A
U
A
U
UG
C
U
C A
G
A G
G
A
G
G
A
U
G
U
G
G
U
U A U G
G
C
C
G
C
C
G
C
A
G
G
G
C
G
G
C
U C
U
C C G G G C G C C
U
G
A
C
G G G C U C G G
C
G A
A
U
C
C
A
G
A
G
A
C G
G
G
C
A
C
C
G
G
U
C G
U
G
U C
C
G
G
U
G
C
C
G
C
U
C G U
A
A
C
C
A
U U U
U G C
U
C C
G
U G
G
A
G
G A
U
C
U
G
G
C
U
A
U G
Figure 5.5: A progressive prole alignment of predicted structures for ROSE
elements. These genes encode for RNA molecules that have regulatory function
triggered by the environmental heat, so called RNA thermometers [27, 146]. The
structures were predicted by RNAfold using the default parameters. In the shown
progressive alignment a single structure prole is joined to a cluster of proles in
each step. Note that this is the optimal joining procedure in this example but it is
also possible to merge clusters of proles: If the score between the two rightmost
structures would be slightly better, these structures were joined and the resulting
cluster would be merged with the prole of the two leftmost structures.
5.5 A Structure Prediction Strategy based on Multiple RNA Secondary
Structure Alignment 131
5.4.1 Eciency Analysis
Time Complexity The asymptotic time complexity of Algorithm 5.1 is
as follows: Let there be n structures of average size s, measured in terms of
nodes in the corresponding forest. Let d be the average degree of a forest
node. The pairwise algorithm has time eciency O(s
2
d
2
) and space e-
ciency O(s
2
d
2
), see Section 3.3.2. In Step 2, this algorithm is called for
all pairs of tree proles yielding time eciency O(s
2
d
2
n
2
). Both, Step 3
and 4 are repeated n1 times. In Step 3, a maximum weighted matching is
calculated in O(n
3
) using Gabows algorithm [51]. In the ith iteration, Step 4
computes ni pairwise alignment scores. Consequently, the overall runtime
of Step 3 to 5 is in O(s
2
d
2
n
2
+ n
4
). Thus, the runtime of Algorithm 5.1
is in O(s
2
d
2
n
2
+ n
4
).
Space Complexity In Step 1, n forests are stored, requiring O(ns) space.
The allocated space of a pairwise alignment can be freed after the alignment
score is calculated. The scores are stored in a table of size n
2
. In Step
4, the optimal alignment is obtained by recalculating the alignment and a
backtracking procedure in O(s
2
d
2
) time and O(s
2
d) space. Thus, the
overall space requirement of Algorithm 5.1 is O(n s + n
2
+ s
2
d
2
).
From the observations in Section 4.8, the degree of an RNA secondary
structures can be considered as a constant. Hence, multiple RNA structure
alignments under the tree alignment model can be calculated with the same
asymptotic eciency as multiple sequence alignments.
5.5 A Structure Prediction Strategy based
on Multiple RNA Secondary Structure
Alignment
A multiple sequence alignment is often the rst step in determining a consen-
sus structure (see Section 2.4). Homologous sequence regions are matched
132 Multiple Alignment of RNA Secondary Structures
in the alignment and force regions of sequence diversity between them to be
aligned to each other. Among a family of structural RNAs, bases-pairs are
expected to be less sequence conserved in an alignment than unpaired bases.
Thus, regions of diversity in an alignment are subject to structural observa-
tions. It is obvious that such a strategy requires a considerable amount of
sequence conservation to be successful.
I use the multiple structure alignment to predict consensus structures
the other way around. First, the structure of each RNA is predicted ther-
modynamically. Second, a pure multiple structure alignment is computed
by the algorithm presented in the previous section. A pure structure align-
ment means that sequence conservation is not favored by the scoring scheme
(see Section 4.5). In contrast to the sequence based approach, structural
conserved regions are the anchor regions of the alignment.
The success of this strategy depends largely on the accuracy of secondary
structure prediction from single sequences. Unfortunately, for a set of RNA
sequences that belong to the same family, the predicted structures are often
diverse and not always compatible with a similar consensus structure. For
instance, in the multiple alignment example in Figure 5.5 the rightmost struc-
ture does not t well to the others. Looking at the suboptimal structures
reveals a structure that is in better correspondence to the others. Further-
more, for a successful application of Algorithm 5.1 to the proposed structure
prediction strategy the following must be guaranteed: First, all sequences
share a similar structure, i.e. they belong to the same family. Second, there
is only one structural conformation for the sequences.
The remedy is a clustering of predicted structures in the progressive calcu-
lation of the structural alignment. To facilitate clustering of forests, joining
alignments in Step 4 is restricted to a minimal cuto value c. If the best
alignment score of P
i
and P
j
in Step 3 is below c, these proles are put into
the result list of clusters which are not aligned further.
The structures that are aligned are the result of structure prediction algo-
rithms based on thermodynamics. If only the base-pair information of a pre-
5.5 A Structure Prediction Strategy based on Multiple RNA Secondary
Structure Alignment 133
diction is used, a base is either paired or unpaired. Thus, stable and unstable
base-pairs are indistinguishable. Reasonably, the deletion of a weak base-pair
should not be penalized as high as the deletion of a strong base-pair. The pro-
le representation of structures allows an elegant way to weight base-pairs by
incorporating base-pair probabilities. I calculate base-pair probabilities using
McCaskills partition function algorithm and weight the P-nodes in the initial
prole forests according to the base-pair probabilities. It has been observed
by Gardner & Giegerich that pruning of base-pairs with a low probability
can improve the results of a structural alignment of predicted structures. A
threshold value p determines the minimum probability of a base-pair that
is required to generate a corresponding P-node in the extended forest repre-
sentation. Algorithm 5.2 shows the modied multiple structure alignment
algorithm for consensus structure prediction from RNA sequences.
Input: RNA sequences S
1
, S
2
, . . . , S
n
,
clustering cuto c, minimum base-pair probability p.
Output: A list of prole forest.
Calculate McCaskills partition function for S
1
, S
2
, . . . , S
n
and build 1
the weighted single structure proles P
1
, P
2
, . . . , P
n
according to the
mfe structure and threshold p.
Construct all
n(n1)
2
pairwise relative similarity scores of P
1
, . . . , P
n
. 2
Compute a maximum weighted matching M for the pairwise 3
similarities.
Choose P
i
and P
j
of maximal similarity according to M 4
if
SP REL
(P
i
, P
j
) > c then 5
Compute their alignment P
ij
and replace both by P
ij
. 6
else 7
Put P
i
and P
j
in the result list. 8
end 9
Compute the relative similarity score of P
ij
with all others. 10
Iterate Lines 3 to 8, until no proles are left. 11
Algorithm 5.2: Progressive prole alignment of forests.
I suggest to set the clustering cuto c to zero, as zero separates the structures
that are rather similar from those that are rather dissimilar due to some
similarity scoring scheme that includes positive and negative contributions.
134 Multiple Alignment of RNA Secondary Structures
Complexity Algorithm 5.2 calculates the partition function for all n se-
quences of average length s. Each prediction is done in O(s
3
). Thus, Step 1
requires O(n s
3
) time. The complexity of the remaining calculations is the
same as for Algorithm 5.1. That is, the time complexity of Algorithm 5.2 is
O(n s
3
+s
2
d
2
n
2
+n
4
) time where d is the average degree of the predicted
proles. The O(n s +n
2
+s
2
d
2
) space complexity remains unaected since
each calculation of the partition function requires O(s
2
) space.
Chapter 6
RNAforester: A Tool for
Comparing RNA Secondary
Structures
RNAforester is a command line based tool for comparing RNA secondary
structures. It supports the computation of pairwise and multiple alignment
of structures based on the models and algorithms presented in Chapter 4
and Chapter 5. The user interface follows the philosophy of the Vienna
RNA Package [84] and will be part of the forthcoming Vienna RNA Package
Version 1.6. The tools and technologies that are behind RNAforester are
outlined in Section 6.1. The command line usage and options are explained
in Section 6.2. The online interface of RNAforester is shown in Section 6.3.
6.1 Implementation Notes
RNAforester is implemented in the programming language C++ [187]. The
source code distribution is freely available at http://bibiserv.uni-bielefeld.
de/rnaforester. The source code distribution is packaged using the GNU
Build Tools: autoconf and automake [61]. The source code of RNAforester
is documented using the documentation system Doxygen [211]. Doxygen can
136 RNAforester: A Tool for Comparing RNA Secondary Structures
generate an on-line documentation in HTML and o-line reference manual in
various formats. It can visualize the relations between the various elements
by means of include dependency graphs, inheritance diagrams, and collabo-
ration diagrams, which are all generated automatically. The documentation
is extracted directly from the sources, which makes it comfortable to keep
the documentation consistent with ongoing development.
The generation of 2d plots is facilitated by Milanovic &Wagners g2 graph-
ics library [136]. The clustering of proles in the progressive calculation of
multiple alignments employs Ed Rothbergs implementation of Gabows N-
cubed weighted matching algorithm [137]. The clusters that are built during
the calculation of multiple alignments are written to Graphviz compatible
les for further analysis [64].
The data structures and algorithms of RNAforester are designed using
C++s template mechanism. Templates are very useful for the implementa-
tion of generic constructs like vectors, stacks, lists, queues which can be used
with any arbitrary type. Accordingly, a forest alignment can have any type
of labels. In data types and algorithms the labels become a type parameter.
C++ templates provide a way to re-use source code as opposed to inheritance
and composition which provide a way to re-use object code. C++ provides
two kinds of templates: class templates and function templates. In the Stan-
dard Template Library (STL) generic algorithms have been implemented as
function templates, and the containers have been implemented as class tem-
plates. These implementations achieve excellent practical runtime, since the
expensive type replacements happen at the compile time. Since I followed
the STLs template philosophy, my algorithms can easily be integrated in
tools beyond the scope of RNA secondary structure comparison.
6.2 User Manual
By default, RNAforester calculates pairwise similarity between RNA sec-
ondary structures under the scoring scheme proposed in Section 4.5.1. RNAforester
6.2 User Manual 137
reads RNA secondary structures from stdin in Fasta format where matching
brackets symbolize base-pairs and unpaired bases are represented by dots.
An example is given below:
> test
accaguuacccauucgggaaccggu
.((..(((...)))..((..)))).
All characters after a blank are ignored and all - characters are removed.
The program will continue to read new structures until it encounters a @
character or the end of le. Lines starting with > can contain a structure
name.
The similarity scores, alignments, and consensus sequences and structure
are written to stdout. The default format for alignments is ClustalW format.
6.2.1 Options
RNAforester has a number of options that control the alignment mode and
the output. In the following description, int and dbl stand for integers and
oating numbers, respectively.
--help: Shows the synopsis of RNAforester.
--version: Shows version information of RNAforester.
-f=lename: This option lets RNAforester read input from lename.
2d-plots are written to les prexed with lename.
-d: This option lets RNAforester calculate distance instead of simi-
larity. In contrast to similarity, scoring contributions are minimized.
This parameter cannot be used in conjunction with multiple align-
ment. (This restriction is due to technical diculties with the maxi-
mum weighted matching algorithm used in the progressive calculation
of multiple alignments.)
138 RNAforester: A Tool for Comparing RNA Secondary Structures
-r: Calculate relative similarity scores for pairwise alignments, see
Equation (5.6) in Section 5.3.
-l, -s, -so=int: Option -l and -s let RNAforester calculate local sim-
ilarity and small-in-large similarity (see Section 4.6). If parameter -so
is used additionally, suboptimal solutions are calculated such that the
score is within so percent of the optimum.
-m, -mc=dbl, -mt=dbl: Option -m activates the multiple alignment
mode of RNAforester. The clustering cuto can be adjusted by param-
eter -mc, the default value is zero. To facilitate joining multiple clusters
in each step, parameter -mt can be adjusted (see Section 5.5). The de-
fault value is 0.7 (relative similarity). All pairs above this threshold
are joined in each step of the multiple alignment calculation (see Sec-
tion 5.4). The clusters are written to a le cluster.dot in Graphvizs
dot format. If a lename was specied by parameter -f the lename is
lename cluster.dot.
-p, -pmin=dbl: Structures are predicted from the partition func-
tion algorithm in the Vienna RNA library. The P-nodes (representing
base-pair bonds) in the corresponding forests are weighted according
to base-pair probabilities from the partition function (see Section 5.5).
Parameter -pmin sets the minimum frequency that is required to create
a P-node in the extended forest representation. The default value is
0.5. By this parameter, a pruning of high entropy base-pairs is possible.
-cmin=dbl: This parameter sets the minimum frequency that is re-
quired for a base-pair to appear in nal the consensus structure.
-pm=int, -pd=int, -bm=int, -br=int, -bd=int: Set the scoring
values for a base-pair bond match, a base-pair bond deletion, a base
match, a base replacement, and a base deletion according to the scoring
model described in Section 4.5.1. The default values are -pm=10, -pd=-
5, -bm=1, -br=0, and -bd=-10.
6.3 RNAforester Web Interface 139
--RIBOSUM: Uses the scoring model described in Section 4.5.2 with
the RIBOSUM85-60 matrix. The RIBOSUM score is only supported
for pairwise alignments.
-2d, --2d hidebasenum, --2d basenuminterval=int, --2d grey,
--2d scale=dbl, --2d png, --2d jpg: Option -2d activates the gen-
eration of 2d-plot postscript les. In the pairwise alignment mode,
the drawings are written to les x n.ps and y n.ps where n is an
index. If local similarity (-l,-s) in conjunction with suboptimal solu-
tion (-so) is set, n enumerates the suboptimal solutions. The region
of local similarity are highlighted in the 2d-plots of the original struc-
tures that are written to the les x str.ps and y str.ps. Parameter
--2d hidebasenum disables the numbering of bases according to the in-
terval --2d basenuminterval. Colors can be turned into gray-scale by
parameter --2d grey. The size of the drawing can be adjusted by pa-
rameter --2d scale. The drawing is scaled by the given factor. The
parameters --2d png and --2d jpg let RNAforester write 2d-plots in
PNG and JPG format.
--fasta: Alignments are printed to the console in Fasta format.
--score: Only the optimal score of an alignment is printed. This option
is useful when RNAforester is called by another program that only
needs a similarity or distance value.
6.3 RNAforester Web Interface
The online version of RNAforester is available at the Bielefeld Bioinformat-
ics Server (http://bibiserv.uni-bielefeld.de/rnaforester) [176]. A
screenshot is shown in Figure 6.1. For more informations refer to the online
manual.
140 RNAforester: A Tool for Comparing RNA Secondary Structures
Figure 6.1: Screenshot of the web interface of RNAforester.
Chapter 7
Applications
In this chapter, I demonstrate the practical impact of the Algorithms that
were presented in this thesis. In Section 7.1 I present a joint work with T.
Toller and R. Giegerich that is initially described by T oller in [203]. We
present a pipeline for the detection of new regulatory motifs that, as an inte-
gral part, includes the computation of local structure alignments. Multiple
alignment of RNA secondary structures is helpful to reveal a common struc-
tural property of RNAs. We present two applications for multiple structure
alignment, motif discovery and consensus structure prediction in Section 7.2.
The latter application is initially based on sequence information where no
veried structures are given as proposed in Section 5.5.
7.1 Local Structure Alignment as a New Strat-
egy for RNA Motif Detection
This section presents joint work with T.T oller and R. Giegerich. The investi-
gation of structural RNAs or RNA motifs is a major task in modern molecular
biology. Untranslated terminal regions (UTRs) of mRNAs sometimes con-
tain regulatory motifs which are important for the posttranscriptional gene
regulation. Such motifs can aect mRNA localization [91], mRNA degra-
dation [69], and translational regulation [65]. One of the best investigated
142 Applications
regulatory motifs in UTRs is the Iron Responsive Element (IRE). It is a
specic stemloop structure that can be found in the 5UTRs and 3UTRs
of various mRNAs [101]. There is for example one IRE in the 5UTR of
the vertebrate ferritin mRNAs where it regulates the translational eciency
depending on the amount of iron in the cell. If there is no iron in the cell
regulatory proteins bind to the IRE which results in a translational block of
the ferritin mRNA. In contrast, protein binding to ve IREs in the 3UTR of
the human transferrin receptor mRNA leads to a stabilization of this mRNA
at a low iron-level in the cell. Thus, the same structural RNA motif func-
tions in dierent posttranscriptional regulatory pathways depending on its
location in the 5UTR or 3UTR.
7.1.1 Strategies for the Detection of RNA Motifs
Regulatory RNA motifs like the IRE often consist of both, sequence and
structure features. Therefore special requirements exist for the prediction of
such regulatory RNA motifs. There are essentially three strategies that can
be used for RNA motif detection (refer also to Section 2.4).
The simultaneous strategy is the joint optimization of sequence alignment
and RNA folding and was rst postulated by Sanko [172]. But because of
its time complexity it cannot be used for real data.
The sequenced based strategy, in the initial step, calculates a sequence
alignments to identify regulatory motifs by their conservation on sequence
level. In a second step an RNA folding program like RNAfold can be used
to verify that the conserved sequences can built a common structure motif.
Since regulatory motifs in RNAs are often more conserved in structure then
in sequence this strategy will fail to identify such motifs.
The pure structure alignment strategy (for short pure strategy) was in-
troduced and applied by Toller in [203]. It is based on RNA folding and
subsequent detection of conserved motifs. These are purely structure motifs
and require no sequence alignment - this explains the name of the strategy.
The calculation of local structure alignments with the algorithms that were
7.1 Local Structure Alignment as a New Strategy for RNA Motif Detection 143
introduced throughout this chapter, implemented in the tool RNAforester
(see Chapter 6), is an essential step of the strategy. The pure strategy fol-
lows the protocol shown in Figure 7.1, and will be explained throughout the
next sections. We will report on a viability study using IRE motifs, and on
the prediction of a new regulatory motif and its wet lab validation.
7.1.2 The Pure Structure Alignment Strategy
Pattern Definition
Experimental Validation
(ADP)
(RNAMotif)
Database Search
Significance Evaluation
(RNAforester)
Pure Structure Alignment
(RNAfold)
RNA Folding
Figure 7.1: Essential Steps of the Pure Structure Alignment Strategy.
The pure strategy (see Figure 7.1) comprises six steps where RNA folding,
structure alignment, signicance evaluation and pattern search are based on
suitable Bioinformatics methods. All steps may include some variation of
parameters. Pattern design and signicance analysis is a somewhat mathe-
matical activity, while validation means experimental work with its typical
fallacies. We describe these steps and the considerations that guide them in
the context of two applications of our strategy.
144 Applications
H C
A G
U G
N N
N N
N N
N N
N N
C
N N
N N
N N
Figure 7.2: Iron Responsive Element: eukaryotic consensus structure (H:
A,C,U). The cytosin bulge can be extended to an internal loop.
Proof of Concept
To validate the pure strategy we focus on the investigation of structural motifs
in untranslated terminal regions (UTRs) of mRNAs. As mentioned before
one of the best investigated regulatory motifs in UTRs is the Iron Responsive
Element. Figure 7.2 shows a consensus structure of eukaryotic IREs. It is a
specic stem-loop structure which consists of a helix region that contains a
cytosin bulge (this bulge is sometimes extended to an internal loop) and a
loop of six bases with a consensus sequence. All this knowledge is not to be
used in our proof-of-concept study.
UTRs which are known to contain IREs were taken from the UTR data
base [154]. We chose the ferritin 5UTR from human and the succinate
dehydrogenase 5UTR from Drosophila which are in a size range of 200 nu-
cleotides and both contain one IRE. The pure strategy was also applied for
the detection of an IRE in a larger UTR, namely the transferrin receptor
3UTR which consists of nearly 2500 nucleotides and contains ve IREs.
First the mfe (minimal free energy) structures of the UTRs were predicted.
Because we are interested in regulatory motifs which can be seen as small
substructures in the complete UTR secondary structures, we calculated local
structure alignments with RNAforester.
In Figure 7.3 the local alignment of the 5UTRs of the human ferritin
heavy chain mRNA and the Drosophila succinate dehydrogenase mRNA is
7.1 Local Structure Alignment as a New Strategy for RNA Motif Detection 145
ua
ua
cu
cu
u-
G
C
ua
ua
ca
ac
ag
C
A
G
U
G
C
uc
ug
gu
gu
au
C
ga
ga
au
au
Figure 7.3: Local structural alignment of the 5UTRs of the human ferritin heavy
chain mRNA and the drosophila succinate dehydrogenase mRNA, extracted as the
best common motif from two structures comprising about 200 bases. The stem
regions of both IREs dier extremely in sequence.
displayed. The IRE was detected as the most similar substructure in both
UTR secondary structures.
Because it is not guaranteed that the energetically best structure is the
biologically correct one, suboptimal structures should always be investigated
too. In this example we were successful by aligning only the two mfe struc-
tures, which contained the IRE. The investigation of larger structures like the
transferrin receptor 3UTR, structure prediction becomes a general problem.
Structure predictions based on thermodynamic parameters are only reliable
for smaller structures and even then, energetically suboptimal structures have
to be considered. Still, folding long sequences and investigation of the struc-
tures does make sense for nding smaller motifs in these larger structures. If
a structural RNA motif has an important biological function (e.g. the IRE) it
should be very stable and we can expect to nd it in the structure prediction
of a long sequence even if the folding of the complete sequence makes no sense
biologically. To show this we have calculated the local structure alignment
of the human ferritin heavy chain 5UTR (208 nucleotides) and the human
transferrin receptor 3UTR (2464 nucleotides). We detect the IRE again (see
Figure 7.4), although it occurs at completely dierent positions in the two
UTRs. Thus, using the pure strategy for RNA motif detection we are not
146 Applications
ga
gu
ga
g-
u-
ua
U
cu
ca
u-
gu
C
ug
ug
ca
A
ag
C
A
G
U
G
C
uc
U
gu
gc
ac
ca
gu
ga
A
au
cu
ca
cu
Figure 7.4: Local structural alignment of the 5UTR of the human ferritin heavy
chain mRNA (208 nucleotides) and the 3UTR of the human transferrin receptor
mRNA (2464 nucleotides). The IRE was detected as the most similar motif in
both structures.
restricted to small structures.
It is important to note that we are able to discover regulatory motifs
solely by their structural preservation, and independent of their sequence
conservation and position in the UTR. The further steps of the pure strat-
egy according to Figure 7.1 will be presented in the next section, where we
describe the prediction of a potential new regulatory RNA motif.
Prediction of a new Regulatory RNA Motif in the RAB1A 3UTR
After demonstrating the viability of the pure strategy using the familiar IRE
motif, we ventured out to discover something new.
RAB1A is a ubiquitous protein with a role in Endoplasmatic Retikulum
(ER) to Golgi transport [128]. In previous work a high sequence conservation
in several vertebrate RAB1A 3UTRs was shown [228]. Therefore a function
of a structural motif for the posttranscriptional regulation of the RAB1A
mRNA is possible. The pure strategy was used for the prediction of potential
regulatory elements in the RAB1A 3UTR. We started with the investiga-
tion of the human and electric ray RAB1A 3UTRs, which have much less
sequence-conservation than the UTRs described in [228]. In Figure 7.5 the
mfe structure of a part of the human RAB1A 3UTR and in Figure 7.6 the
7.1 Local Structure Alignment as a New Strategy for RNA Motif Detection 147
u
c
u
g
c
c
u
c
c
a
u
c
c
u u u
u
c
u
c
a
c a g
c
a
a
u
g a
a
u
u u
u
g
c
a
a
u
c
u
g
a
a
c
c
c
a
a
g
u
g
a
a
a
a
a
c a a
a
a
u
u
g
c
c
u
g
a
a u u
g
u
a
c
u
g
u
a
u
g
u
a
g c
u
g
c
a
c
u
a
c
a
a
c
a
ga
u
u
c
u
u
a
c
cg
u
c
u
c
c
a c
a
a
g
g
u
c
a
g a
g
a
u
u
g
u a
a
a
u
g
g
u
c
a
a
u
a
c
u
g
a
c
u
u
u
u
u
u
u u
u
u
a
u
u
c c
c
u
u
g
a
c
u
c a a
g
a
c
c
g
c u a a
c
u
u
c
a
a
u
u
u
c
a
g
a
a
c
g
u
gu
u
u u
a
a
a
c
c u
u
u
g
u
g u
g
c
u
g
g
u
u
u
a
u
a
a
a
u
a
a u
Figure 7.5: mfe structure for a part of the human RAB1A 3UTR.
mfe structure of the electric ray RAB1 3UTR is displayed.
For these structures a local alignment was calculated using RNAforester.
Figure 7.7 shows this alignment. The detected stemloop is not only highly
conserved in structure but also in sequence. Although RNAforester can make
use of such sequence similarity, the scoring contribution for a base-matches
was set to zero
1
, thus purely relying on structure conservation. Analysis
of base pair probabilities conrmed the stability of this stemloop in many
energetically suboptimal structures (data not shown).
For the computational validation of the predicted stemloop we performed
a database search. First a search pattern was dened that should be general
enough to nd as many occurrences as possible. At the same time it should
be specic enough to nd as few false positives as possible. Therefore we used
signicance evaluation based on the ADP method [58, 135] for the denition
of a search pattern. Figure 7.8 shows the pattern for the predicted stemloop.
1
A small contribution of sequence similarity in comparison with structural similarity
is also feasible. Loop regions would be aligned on the sequence level without dominating
the sequence structure alignment, see Section 4.5)
148 Applications
g
a
a
u
u
u
u
c
a
c u c
u
c
c
g
c
a
u
u
u
g
a
u
u
u
a
a
a a g
g
c
c
c
a
a
a
a
g
u
a
a
g
a
a
auguu
g
c
c
u
g
a
a
u
u
g
u
a
c
u
g
u
a
a
g
u
a
g
c
c
a
c
a
c
u
a
c
u
g
c
a
g
a
c
u
u
u
c c
c
c
u
u
u
c
c
c
u
g
c
c c
c
c
u
u u
c
a
g
a
c a
c
a
a
u
c
a
g
g
c
a
c
u
g
a
g
a
a
g
c
c
a
g
g
a
u
u
g
g
c
u g
g
g
a
u
u
Figure 7.6: mfe structure for the electric ray RAB1A 3UTR.
C
U
G
U
A
ua
G
U
A
G
C
uc
ga
C
A
C
U
A
C
au
ag
C
A
G
Figure 7.7: Local structure alignment of the RAB1A 3UTRs from human and
electric ray. The resulting stemloop is also highly conserved in sequence with few
compensatory base exchanges in the helix.
C N N N N N N N N 3
N N N N 5
A
G N N N N
C
A
C
Y
R
Figure 7.8: RAB1 stemloop pattern for the database searches. The adenine bulge
and the loop with the closing base pair was xed in sequence with some variability
in the second (U or C) and third (A or G) position.
7.1 Local Structure Alignment as a New Strategy for RNA Motif Detection 149
The length of the helix was xed but the sequence was kept variable.
Bulges promote RNA-protein interactions, thus the adenine bulge is an im-
portant element of the pattern. The loop sequence was predened with
partial variations at the second and third position. Pattern design and sig-
nicance evaluation is an iterated process. For the resulting pattern, we
computed an expectation value of 0.8 hits in a random sequence with size
and base composition of the 3UTR collections of the UTRdb version 15.0
[154]. These collections contain about 47 million bases. For the database
search we used the program RNAMotif. Combined sequence and structure
motifs can easily be described within a descriptor le and that le can than
be used for database searches. Our dened pattern for the RAB1 stemloop
was described with RNAMotif as follows:
parms
wc += gu;
descr
h5(tag=stem1, len=4)
ss(len=1, seq="a")
h5(tag=stem2,len=5,seq="g$")
ss(len=5, seq="cyrca")
h3(tag=stem2, seq="^c")
h3(tag=stem1)
The 5 site of the rst helix with xed length 4 is followed by the adenin
bulge and the 5 site of the second helix with xed lenght 5 and a guanine
at the end. The loop region of length 5 has variable sequence at position 2
(uracil or cytosin) and position 3 (guanine or adenine). The 3 site of the
second helix must start with a cytosin. We used this pattern for searching
the 3UTR collections of the UTRdb. The result of this search is summarized
in Table 7.1.
Signicantly more hits (13) than would have been expected in a random
database (0.8) were found and these hits were exclusively to RAB1 and Sir2
3UTRs. There were no hits found in the 3UTRs of invertebrates or viruses
150 Applications
UTRdb collection Hits
Human
3UTR
RAB1A
Sir2
Rodent
3UTR
RAB1A (mouse)
Sir2, clone (similar to Sir2)(mouse)
Other Mammals RAB1A (cat), RAB1A (opossum),
3UTR RAB1A (bull), RAB1A (quolls),
RAB1A (kangaroo)
Other Vertebrates RAB1A (alligator), RAB1A (chicken)
3UTR RAB1 (electric ray)
Invertebrates 3UTR
Virus 3UTR
Table 7.1: Results of the database search for the RAB1A stemloop pattern. The
stemloop occured only in RAB1A and Sir2 3UTRs of vertebrates. There were
no hits in 5UTRs.
or in any of the 5UTR collections. This makes a biological function of the
stemloop very likely.
Gelmobility-Shift Experiments
To get further hints for the biological function of the stemloop we did sev-
eral laboratory experiments. Posttranscriptional gene regulation is often the
result of specic protein interactions with regulatory motifs in UTRs. There-
fore protein binding to the predicted stemloop is very likely if the stemloop
has a biological function. We performed gelmobility-shift assays for showing
such an RNA protein interaction (see Figure 7.9).
Digoxigenin labeled RNA oligos whose sequence matched the mouse (and
also human) stemloop (Lane 1) were incubated with protein extract from
mouse kidney (Lanes 2-5). In Lanes 3-5 complex formation was reduced us-
ing rising amounts of unlabeled RNA oligos. The control in Lane 6 (only
protein extract) shows that the band on the top is the result of a cross reac-
tion of the Digoxigenin antibody with the protein extract. Below this band
a small complex band can be seen but most of the labeled oligos didnt run
into the gel and we assume thats the result of a large protein complex inter-
acting with the RNA stemloop. We also tried to isolate the binding proteins
using RNA oligos bound to magnetic beads and here we found lots of pro-
tein bands (data not shown) which can be another hint for a large protein
7.1 Local Structure Alignment as a New Strategy for RNA Motif Detection 151
1 2 3 4 5 6
Figure 7.9: Gelmobility-shift assay: lane 1: RNA-Oligo (DIG-labeled), lane 2:
RNA-Oligo(DIG) + protein extract, lane 3: RNA-Oligo (labeled:unlabeled= 1:50)
+ protein., lane 4: RNA-Oligo (labeled:unlabeled = 1:150) + protein, lane 5: RNA-
Oligo (labeled:unlabeled= 1:300) + protein, lane 6: only protein as negative control
for DIG-detection; lanes 2-5 contained an excess of yeast RNA as an unspecic
competitor.
complex interacting with the stemloop (even though some unspecic RNA
protein interactions might occur). Although more experiments have to be
done to elucidate the exact biological function of the predicted stemloop, the
high conservation of the stemloop in dierent vertebrates, its main restric-
tion to RAB1 3UTRs and the rst experimental hints for specic protein
interactions with the stemloop let us assume, that we found a new RNA
motif for posttranscriptional gene regulation.
The new Potential Regulatory Motif in the RAB1A 3UTR The
pure strategy was used for the prediction of a structural motif in the RAB1A
3UTRs of human and electric ray. The predicted stemloop is very stable
and highly conserved in sequence. Although we created a search pattern for
this stemloop, that is highly variable on sequence level in the stem region,
a database search in the UTRdb showed hits only in RAB1A 3UTRs of 10
vertebrates and also in the Sirtuin 2 3UTR of human and mouse. This
restriction of the stemloop to only 2 dierent mRNAs makes a biological
function very likely. The gelmobility-shift assays revealed protein interac-
tions with the stemloop and in ongoing experiments we try to identify the
binding proteins for getting more information about a possible posttranscrip-
152 Applications
c
g
c
g
a
c
u c
g
u
c
c
a
c
c
g
c
g
g
c
a
g
c
u
a g
g
c
u
c
g
c
a
a
g
a
c
c a
cc
c
c
a
c
c
c
u
c
c
c
a
a
a
c
c
g
c
uucc
c
c
a
a
g
a
g u g u
c
g
u
u
g
g
c u u u c c g u a a
u
c c
u g
c a g c a a
c a
g
u
g c
u a g c a a a g
a u a c c a a a g
a
a
g
a
c
a
a
a a a g
a
a
g
a
c
c
c
c
g u
g
a
c
a
g
c
u
u
u
g
c
u g
u
u
g
u
u
g u
u
u
g
c
c u
u
a
g
u
u
g
u
c c u
u
u
g
g
g
g
u
c
u
u
u
a
g
a
c
a
u
a a
g
g
c
a
a
a
u
c
c
g c u c g
u
a
c
c
c
c
a
c
c
c
c
c
u
a
g
u
u
c u c
u
g
c
c
a
aag
c
c
c
a
c a
c
c
c
a
a g
g
c
c
c
u
c
c
g
u
c
a
c
c
u
c
u u c
a
c
c
g
c
a
c
c
c u c
g
g
a
c
u
g
c
c
c
c
a
a
g
c
c
a
a
c
c
cc
u
c
c
gc
g
c
a
u
u
c
c
c
c
c
g
u
g
c
u
a
u
c u c
c
a
g
u
u
c c u
u
g
c
a
c
a
c
c
g
c
u
u
a
c
c
g
a
c
c
a
c
a
u
c
a u
c
c
c
g
g g c
c
g
a
g
a g
c
a g
a
a
g
a
a g
c
a
g
c
a g
c
a
g
c
c a
c
g
c
c
c
a
a
c
a
c
a g c g
g
a
a
c
a
a
c
c
u
a
g
u
g
a
a
u
a
c
g
a
a
c
g
c
c
u
c
g
c
c
g
c u
c
c
a
g
c
g
u
c
g
c
c
a
c
c
g
c
g
c
c
u
c
g
c
c
c
c
g
c
c
g
c c
a
c
c
Figure 7.10: Multiple alignment of the four 5UTRs of human and mouse ferritin
heavy chain mRNA (5HSA015337, 5MMU002159) and SLC11A3 iron-transporter
mRNA (5HSA023193, 5MMU011005). The alignment clearly superposes the IRE
elements, automatically marked red by our visualization
tional gene regulation of the RAB1A mRNA. Because the RAB1A protein
is localized near the cis-golgi membrane [174] we assume that the RAB1A
mRNA could be localized previously. Identication of the binding proteins
should give us hints for a role of the stemloop in a possible localization of
the RAB1A mRNA.
7.2 Multiple Alignment
7.2.1 Motif Discovery
Multiple structure alignment can be used for searching regulatory structural
motifs common to several RNAs. One of the best investigated regulatory
motifs is the iron responsive element (IRE), which is a specic stem-loop
structure and can be found in the untranslated terminal regions (UTRs) of
7.2 Multiple Alignment 153
many mRNAs. It regulates for example the translational eciency of these
mRNAs according to the amount of iron in the cell [10]. The 5UTRs of hu-
man and mouse ferritin heavy chain mRNA and SLC11A3 iron-transporter
mRNA were taken from the UTR data base [154]. These UTRs are known
to contain iron responsive elements. Their secondary structures were pre-
dicted with mfold (Version 3.1) [244] and a multiple structure alignment of
the UTRs was calculated using RNAforester. In Figure 7.10, the resulting
alignment is displayed. The red colored stemloop shows the conserved iron
responsive element that occurs in all structures. All other structural elements
shown in black or gray can only be found in some of the structures. Thus,
the described approach is useful for the detection of common structural mo-
tifs in a set of RNA secondary structures. This example works well because
the element of interest resides in similar positions in the globally aligned
structures. Should this positions vary, a local similarity comparison can be
employed [77]. Unfortunately, this is restricted to pairwise comparisons.
7.2.2 Consensus Structure Prediction
In this Section, I exemplify the structure prediction strategy proposed in Sec-
tion 5.5 that is based on a multiple structure alignment of thermodynamically
predicted structures. Throughout this section this strategy is referred to as
the structure alignment strategy. The converse to the the structure alignment
strategy is a strategy that rst calculated a multiple sequence alignment and
then derives a consensus structure by analyzing covariance and thermody-
namic considerations. This strategy is referred to as the sequence alignment
strategy.
Structure prediction strategies that build upon an initial multiple se-
quence alignment are limited in their success if the sequence identity is too
high or too low. In the rst case, the covariance of conserved base-pairs is
low and the prediction is guided mainly by thermodynamics. In the second
case, the quality of the sequence alignment is often too low in a biological
sense and, hence, covariance can not be inferred from the multiple align-
154 Applications
ment. In particular, the objective function for a multiple sequence alignment
aims for maximization of identity and penalizes covariance. According to
McCutcheon & Eddy, for multiple sequence alignment based strategies, the
sweet spot is at
= 75 85% sequence identity [134]. Washietl & Hofacker
gave a slightly lower bound stating we can conclude that there is obvi-
ously no need for structure alignments above 65% pairwise identity [223].
Thus, a good candidate family for exemplifying my strategy should have
lower sequence homology than 70% to demonstrate that the structure align-
ment strategy is suitable to predict a common fold. The structure alignment
strategy depends on predicted structures from single sequences and the pre-
diction accuracy gets the worse the longer the sequences are. From personal
experience, the sequence length should be less than 300.
The RNA families for my experiments are taken from the Rfam database
(Version 6.1, August 2004) [67, 68]. Rfam is a large collection of multiple se-
quence alignments and covariance models covering many common non-coding
RNA families. The covariance models in Rfam result from hand-crafted
multiple sequence alignments that were collected from serious publications.
These alignments are the seed alignments in the Rfam database. From sev-
eral interesting candidates, I choose two families of riboswitches, the Lysine
Riboswitch and the TPP Riboswitch, and a family of splicosomal RNA, the
U1 spliceosomal RNA.
For the following experiments, I used RNAforester for the structure align-
ment strategy. Note that the prediction of structures is done automatically
by RNAforester as proposed in Section 5.5 where the base-pairs of the pre-
diction are weighted according to the base-pair probabilities. I use the pure
structure scoring scheme proposed in Section 4.5. The clustering threshold
c is zero. According to the observations of Gardner & Giegerich, pruning
high entropy base-pairs can improves the results of structural comparison for
predicted structures [55]. Therefore, I set the minimum probability p that is
required for base-pairs to occur in the predicted structures to 0.8. The cluster
join threshold t is set to 0.7. Except for the minimum probability p these are
7.2 Multiple Alignment 155
the standard setting for RNAforester. The command line for RNAforester
for these settings is: RNAforester -p -2d -pmin=0.8 -f=sequences.fas
where sequences.fas is the le containing the RNA sequences in Fasta
format. For the sequence alignment strategy I calculate multiple sequence
alignment using the online Version of ClustalW from the European Bioinfor-
matics Institute [23]. I use the default parameters. The structure prediction
form the multiple alignment is done by RNAalifold again using default pa-
rameters [82]. The score of an RNAalifold prediction consists of an energy
term (rst term) and a covariance term (second term). Recently Washietl &
Hofacker provided a method how to test a multiple sequence alignment for the
existence of an unusually stable prediction. Their method relates RNAalifold
predictions of a given multiple sequence alignment to the predictions of shuf-
ed alignments. The signicance is assessed in terms of z-scores
2
. In their
experiments, a Z-score below 3 have a false positive rate below 1%. For
the calculation of Z-scores, I used the Perl program alifoldz.pl as provided
in the supplemental material of [223]. alifoldz.pl computes two scores, one
for the forward and one for the backward strand of the sequences. I did no
further ne tuning of parameters for any of the tools used for the following
experiments.
Lysine Riboswitch
Riboswitches are metabolite binding domains within certain messenger RNAs
that serve as precision sensors for their corresponding targets. Allosteric
rearrangement of mRNA structure is mediated by ligand binding, and this
results in modulation of gene expression. This family includes riboswitches
2
A Z-score is a measure of the distance from the mean of a distribution normal-
ized by the standard deviation of the distribution. Mathematically: Z-score = (value-
mean)/standard deviation. Z-scores are useful for quantifying how dierent from normal
a recorded value is. Z-scores are particularly useful when combining or comparing dierent
features or measures. A Z score of 0 represents the mean of counts for all periods. Assum-
ing a normal distribution, Z scores of -1, -2, -3 and +1, +2, +3 indicate that about 67%,
95% and 99%, respectively, of all values are expected by change to fall within this count.
In short, higher (in absolute value) Z scores are likely to be more statistically signicant
in their deviation from the mean.
156 Applications
that sense lysine in a number of genes involved in lysine metabolism [126].
The 48 sequences from the Rfam seed alignment for the Lysine Riboswitch
(Accession number: RF00168) have an average length of 181.3 and an average
identity of 48%. The published consensus structure is shown in Figure 7.11.
RNAforester outputs six clusters that contain more than one structure. The
consensus structure drawings for these clusters are shown Figure 7.12-7.17.
The structure in Figure 7.12 is in good correspondence with the published
one. The clusters in Figure 7.13-7.17 share at most smaller regions with
the published structure. Apparently, The relative sum-of-pairs
SP REL
score
for the clusters does not correlate with the reliability of the predictions.
However, looking at the sequence level, the consensus structure in Figure 7.12
have a considerable amount of sequence variation while the others are highly
sequence conserved. I identify correct predictions based on the following
hypothesis: The more structurally conserved and the less sequence conserved
a multiple alignment is, the more reliable are the predicted structures. In
contrast to the sequence alignment strategy that uses covariation to predict
structures, in the structure alignment method thermodynamic predictions
are validated by covariance. So far, I only consider sequence identity to
identify the best cluster.
7.2 Multiple Alignment 157
Figure 7.11: Consensus structure of the Lysine riboswitch as published in [126].
Figure 7.12: Lysine Riboswitch. Consensus structure of 18 sequences as predicted
by RNAforester. The sum-of-pairs score
SP
for this cluster is 436.177.
158 Applications
Figure 7.13: Lysine Riboswitch. Consensus structure of 7 sequences as predicted
by RNAforester. The sum-of-pairs score
SP
for this cluster is 485.696.
Figure 7.14: Lysine Riboswitch. Consensus structure of 7 sequences as predicted
by RNAforester. The sum-of-pairs score
SP
for this cluster is 312.722
7.2 Multiple Alignment 159
Figure 7.15: Lysine Riboswitch. Consensus structure of 5 sequences as predicted
by RNAforester. The sum-of-pairs score
SP
for this cluster is 349.197
Figure 7.16: Lysine Riboswitch. Consensus structure of 3 sequences as predicted
by RNAforester. The sum-of-pairs score
SP
for this cluster is 562.263
160 Applications
Figure 7.17: Lysine Riboswitch. Consensus structure of 2 sequences as predicted
by RNAforester. The sum-of-pairs score
SP
for this cluster is 414.111
7.2 Multiple Alignment 161
A RNAforester structure alignment produces a sequence alignment as a
coproduct
3
. In the following, I compare the results of the structure alignment
strategy to results of the sequence alignment strategy. Figure 7.18 shows the
RNAalifold prediction for the hand-crafted seed alignment from the Rfam
database. This prediction is in good correspondence with the published one.
Figure 7.19 shows the prediction for the ClustalW alignment of the seed
sequences. Clearly, the sequence alignment can not arrange the bases such
that RNAalifold can derive a common structure. Since RNAforester does
a clustering of the structures, I also compare the RNAalifold prediction for
sequence alignment derived from RNAforesters best alignment (7.12) and
the ClustalW alignment for the sequences that belong to this cluster. The
results are shown in Figure 7.20 and Figure 7.21. In Figure 7.22, I show the
RNAalifold prediction of the Rfam seed alignment restricted to the sequences
belonging to RNAforesters best cluster.
I do the same experiments for the TPP Riboswitch and the U1 spliceoso-
mal RNA and then discuss the results.
3
In the extended forest representation the sequence alignment is the alignment of leaf
nodes.
162 Applications
A
A
U
U
G
A
G
G
U
A
G A
G
G
C
_
G C G U U A
A
U
C
_
A U G A
G
U
A
_
_
G C U U U U C
G
G
_
_
_
_
_
_ _ _
_
A
G
G
G
U
G
A
G
U
A
C
G
A
U
G
_
_ _ _
_
_
_
_
_
_
A
A
G A A A A G U
_
G
A
A
A G G
_
G
A
G
_
C A U C G C
C
G
A A
G
U
G
A
U
U
A
A
A
A
G
G
_
_
_
C G
C
A
_
_
A
C
U
U
U
U
A
U
U
_
U
G
U
U
G
G
G U
U
U
G
_U
A
U
U
_
_
_
G
A
A
U
A _
G
C
U
G
U
A
A
G
A C
U
G
U
C
A
C
A
A
U
A
_
_ _ U
U
A
_
_
_
_
_
_
_
_
_
_ _ _ _
_
_
_
_
_
_
A
A
U
U
G
U
G
G
A
G C G C
U
A
C
C
G
U
G
U
G
A
C G
C G
C G
C
G
U
U
A
U
C
A
U
G
G
C
G
U
U
G
U
A
U
U
A
U
U
A
U
A
U
A
U
A
C
G
G
U
A
U
A
U
U
G
G
U
U
A G
U
U
A
U
A
A
U
Figure 7.18: Lysine Riboswitch. RNAalifold prediction for the seed alignment
taken from the Rfam database. RNAalifold score: 37.70 = 22.44 +15.26, Z:
3.1(2.3).
_
A
A
U
U
G
A
G
G
U
A
G
A
G G C
G
C
G
A
U
G
_
A
U
C
A
U
G
_
A
G
U A A U C
U
U
U
C
A
G
A
A
G
C
U
G
A
_
_
_
_
G
G
A
C
C
U
G
U
G
A
U
G
A
A
U A A U G A
A
_
A
G
G
G
G
A
G
C
A
U
C
G C C
G
A A G
U
G
A
A
_
_
_
_
A
A
A
A
U
G
C
U
C
G
C
A
A A U U
U
G
A
U
U
U
G
U
U
G
G
G
U
A
U
G
_
U
A
U
U
G
A
A
U
A U
G
A
G
_
C
U
G
G
A
C
U
G U C
A
C
A
A
U
A
A
_
_
_
_
_
_
_
_
_
_
_
_ _ _
C
U
C
U
G
U
G
G
A
G
C
G
C
U
A
C C U
A
A
U
U
G
_
_
_
_
_
_
_
_
_
_
_
_ _ _
_
_
_
_
_
_
C
G
U
A
C
G
G
C
G
U A
A G
U
Figure 7.19: Lysine Riboswitch. RNAalifold prediction for the ClustalW align-
ment of seed sequences taken from the Rfam database. RNAalifold score: 2.68 =
0.50 +2.19, alifoldz score: n.a.
7.2 Multiple Alignment 163
_
A
A
U
U
G
A
G
G
U
A
G
_
_
_
A G G
C
G C G G U
_
A
A
U
C A A
G
A
G
U
A
G U C G U U C G
_
_
G
A
_ _ _
G
_
_
G
_
A
U
G
A
G
C
A C C
A
U
_
G
A
A G A A U G G U
G
A
A
A
G G _
G
A
U
_
U
A U C G C
C
G
A A
G
U
G
A
A
U
A
A
_
A
A
A
U
G
_ U
C
A
A
A
U
U
C
U
G
U
U
U
G
_
C
U
_
_
_
G
G
G
G
U
U
G
_
U_
A
U
_
C G
A
A
U
A G
G
U
A
C
A
A
C
A
C
U
G
U
C
A
C _
G
A
A
A _
U
_
_
_
_
_
A _
_
A
_
_
U
C _
G
U
G
G
A
G
A
G
C
U
A
_
C
C
G
U
G
G G A
_
_
C
G
U
G
A
U
U
G
G
C
A
G
G
U
U
G
U
A
G
U
U
A
C
A
U
A
C
G
A
U
C
G
C
G
U
A
Figure 7.20: Lysine Riboswitch. RNAalifold prediction for the RNAforester se-
quence alignment from the consensus structure in Figure 7.12. RNAalifold score:
25.65 = 12.04 +13.62, alifoldz score: 1.9(2.7).
A
G
U
U
G
A
G
G
U
A
G
A
G
G C
G C G G U
_
A
A
U C
A
A
G
_ A
G
U
A
G U C A U U C U
G
A
G
G A G
G
A
U
A
A
_
_
_
_
C G
A
U
G
A
A G A A U G G U
G
A
A _
A
G
G
G A
U
U
_
A U C G C C G
A
A
G
U
G
A
A
C
_
_
A
U
A
U
U
U
C
U C A A
A
U
U
U
U
A
U
A
U
U
G
C
U
G
G
G
G
U
U
G
U
A
U
A
G A
A
U
A
U
G
U
G
U
A
A
C
A
C
UG
U
C
A
C
A
G
A A
U
_
_
_
U
A
C
_
_
_ _
_
_
_
_
_
_
G
U
G
G
A
G
A
G
C
U
A
C
U
A
U
G
G
G
A
C G
C G
C
G
U
G
A
U
G
C
U
G
G
C
U
A
A
U
A
U
G
C
A
U
C
G
A
U
A
U
U
G
G
U
U
A
G
U
G
C
_
A
Figure 7.21: Lysine Riboswitch. RNAalifold prediction for the ClustalW align-
ment for the sequences belonging to the consensus structure in Figure 7.12.
RNAalifold score: 27.72 = 19.08 +8.65, alifoldz score: 2.3(1.1).
164 Applications
A
A
U
U
G
A
A
G
U
A
G A
G
G
C
_
G C G G U A A U U
_
A
A
G A G
U
A
_
_
G U U A U U C G
G
_
_
_
_
_
_ _ _ _
G
G
G
U
U
A
A
C
A
C
C
A
A
U
G
_ _ _ _
_
_
_
_
_
_
A
A G A A U G G U
_
G
A
A A
G
G
_
G A U_
U A U C G C
C
G
A A
G
U
G
A
U
G
A
A
A
A
C
_
_
_
_
C U
C
A
_
_
G
C U
U
U
G
U
U
U
_
U
G
C
U
G
G
G G
U
U
G
_U
A
U
A
_
_
_
G
A
A
U
A _
G
G
U
G
U
A
A
C
A C
U
G
U
C
A
C
A
G
U
A
_
_ _ U
A
U
_
_
_
_
_
_
_
_
_
_ _ _ _
_
_
_
_
_
_
_
A
C
U
G
U
G
G
A G A G C
U
A
C
U
A
U
G
G
G
A
C G
C G
C
G
U
G
A
U
U
A
U
A
A
U
U
G
G
C
A
G
G
U
U
G
U
A
U
U
U
G
G
A
U
A
U
A
C
G A
U
U
G
G
U
U
A G
A
C
G
A
U
Figure 7.22: Lysine Riboswitch. RNAalifold prediction for the Rfam seed align-
ment restricted to sequences belonging to the consensus structure in Figure 7.12.
RNAalifold score: 44.05 = 25.75 +18.29, alifoldz score: 6.8(6.6).
7.2 Multiple Alignment 165
Figure 7.23: Consensus structure of TPP riboswitch as published in [164].
TPP Riboswitch (THI Element)
Vitamin B(1) in its active form thiamin pyrophosphate (TPP) is an essential
coenzyme that is synthesized by coupling of pyrimidine and thiazole moieties
in bacteria. The previously detected thiamin-regulatory element, thi box was
extended, resulting in a new, highly conserved RNA secondary structure, the
THI element, which is widely distributed in eubacteria and also occurs in
some archaea [164].
The 141 sequences from the Rfam seed alignment for the TPP riboswitch
(Accession number: RF00059) have an average length of 104.9 and an average
identity of 52%. Figure 7.23 shows the consensus structure as published in
Rfam. The Figures 7.24-7.29 show the structure predictions analog to the
experiments for Lysine Riboswitch.
166 Applications
Figure 7.24: TPP Riboswitch. Consensus structure of 31 sequences as predicted
by RNAforester.
A
A
U
A
A
U
C
A
C
U
A
G
G
G
G
_
_
_
U
G
C
C
U
U _ _ _ _ _
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
A
U
A
G
G
C
U
G
A
G
A
U
G
_
_
_
_
_
_ _ _ _ _ _
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
A
G
A
C
C
C
U
U
U
G
A
_
_
_
A
C
C
U
G
A
_
U
C
C
G
G
U
U
A
A
U
A
C
C
G
G
C
G
_
U
A
G
G
G
A
_
_
G
G
U
G
A G
U
A
U
U
A
U
U
U U U
A
A U
G C
U A
G C
G U
U
A
C
G
C
G
C
G
G
C
G
C
C
G
Figure 7.25: TPP Riboswitch. RNAalifold prediction for the seed alignment taken
from the Rfam database. RNAalifold score: 12.11 = 9.26+2.85, alifoldz score:
0(1.4).
7.2 Multiple Alignment 167
_ _ _ _ _ _ _ _ _ _ _ _ _
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
A
A
U
A
U
C
U
A
C
U
A
G
G
G
G
U
G
C
C
U
G
U
G
_
_
_
_
_
_
_
_
_
_
G
G
G
C
U
G
A
G
A
G
G
A G A G _ _ _ _ _ _ _ _ _ _
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
A
G
A
C
C
C
U
U
U
G
A
A
C
C U
G
A _
U
C C G G
U U
A
A
U A
C C G G
C
G U
A
G G
G
A
A
G
G
U
G
G
U
A
U
G
A
U
A
A
U
U
A
_
_
_
_
_
G
C
G
U
G
C
G
C
Figure 7.26: TPP Riboswitch. RNAalifold prediction for the ClustalW align-
ment of seed sequences taken from the Rfam database. RNAalifold score: 5.64 =
4.33 +1.31, alifoldz score: 0.9(0.9).
_
_ A _ A A A A C C
_
A
C
U
A
G
G
G
G
_
G
G
C
C
C
_
C
_
U
_
A
U
_
_
G
_
G
G
C
U
G
A
G
A
_
U
G
_
A
G
_
_
G
G
U
U
U U _ U _ _ G _ _
_
C
U
_
U
U
_
A
A
C
C
C
U
U
_
G
A
_
A
_
C C
U
G
_ _
U
_
C U G G
U U
A
A
U A
C C A G
C
G U
A
G G
G
A
A
_
G
U
_
G
G
G
C
U
A
G
U
A
C
G
A
A
U
G
G
C
G
C
Figure 7.27: TPP Riboswitch. RNAalifold prediction for the RNAforester se-
quence alignment for the consensus structure in Figure 7.24. RNAalifold score:
4.98 = 3.35 +1.64, alifoldz score: 1.4(1.1).
168 Applications
_
A
A
C
A
A
C
C
A
C
U
A
G
G
G
G
U
G
C
C
U
U
A U A _
_
_
_
_
_
_
_
_
G
G
C
U
G
A
G
A
G
A
G
A
A
G
U
G
C
G A U _
_
_
_
_
U
C
U
U
_
_
_
_
_
A
A
C
C
C
U
U
U
_
G
A
A
C
C
U
G
A
U
C
U
G
G
U
U
A
A
U
A
C
C
A
G
C
G
U A
_
G
G
G
A
A
_
G
U
G
G
U
U
U
A
A
U A A A U U U
_
_
_
U
G
G
C A
U
Figure 7.28: TPP Riboswitch. RNAalifold prediction for the ClustalW alignment
for the sequences belonging to the consensus structure in Figure 7.24. RNAalifold
score: 6.81 = 5.29 +1.53, alifoldz score: 1.7(1.2).
A
A
C
A
A
C
C
A
C
U
A
G
G
G
G _ _
_
U
G C C U
U
_
_
_
_
_
_ _ _
_
_
_
_
_
_
_
_
_
_
_
_
_ _ _
_
_
_
A
A
G G G C
U
G
A
G
AG
A
G
G
A C
U
G
_
_
_
_
_
_
_
_
_
_
_ _ _
_
_
_
_
_
_
_
_
_
_
_
G
C
U
U
U
U
U
G
A
C
C
C
U
U
U G
A
_
_
_
A
C
C
U
G
A
_
U
C
U
G
G
U
U
A
A
U
A
C
C
A
G
C
G
_
U
A
G
G
G
A
_
_
G
G
C
G
G G U U A U A
C
G
A
A
U
G
U
A
G
C
G
U
U G
U A
U G
U G
U A
G
C
A
U
C
G
Figure 7.29: TPP Riboswitch. RNAalifold prediction for the Rfam seed align-
ment restricted to sequences belonging to the consensus structure in Figure 7.24.
RNAalifold score: 14.56 = 12.50 +2.06, alifoldz score: 18.7(12.3).
7.2 Multiple Alignment 169
Figure 7.30: Consensus structure of TPP riboswitch as published in [107].
U1 spliceosomal RNA
U1 is a small nuclear RNA (snRNA) component of the spliceosome (involved
in pre-mRNA splicing). Its 5 end forms complementary base pairs with
the 5 splice junction, thus dening the 5 donor site of an intron. There are
signicant dierences in sequence and secondary structure between metazoan
and yeast U1 snRNAs, the latter being much longer (568 nucleotides as
compared to 164 nucleotides in human). Nevertheless, secondary structure
predictions suggest that all U1 snRNAs share a common core [107].
The 54 sequences from the Rfam seed alignment for the U1 spliceoso-
mal RNA (Accession number: RF00003) have an average length of 154.9
and an average identity of 59%. This family does not contain the larger
yeast sequences. Figure 7.30 shows the consensus structure as published in
Rfam. The Figures 7.31-7.36 show the structure predictions analog to the
experiments for Lysine Riboswitch.
170 Applications
Figure 7.31: U1 RNA. Consensus structure of 14 sequences as predicted by
RNAforester.
A
U
A
C
U
U
A
C
C
U
G
G
C
C
G
G
_
G
G
U
C
_
_
A A
U
G
G
G
U
G
A
U
C
A
A
G
A
A
G
G
C
C
C
A
U
G
G
C
C
U
_ A
G
G C
U
A
G
U
G
A
C
C
U
C
C
A
_
U
U
G
C
A
C U
U
_
_
C
G
G
_
G
G_
_
G
G
G
G
UG
A
_
C
C
C
U
A
_
C
G
A
U
C
U
C
C C C
A
A
A
_
G
U
G
G
_
_
_
G
G
A
A
C U C
G
_
A
C
G
G
C
A
U
A A U U
U
G
U
G
G
U
A
G
_
_
_
G
G
G
G
G
G
C
C
U
G
C
G
U
U
_
C
_
G
C
G
C
G
G
C C C
C
C
U
C
C
G
G
C
G
C
C
G C
G
C
G
G
U
G
C A
U
C
G
C
G
C
G
G
U
C
G
C
G
U
G
G
U
G
G
G
C
G
U
G
U
C
G
G
C
C
G
Figure 7.32: U1 RNA. RNAalifold prediction for the seed alignment taken from
the Rfam database. RNAalifold score: 23.27 = 14.69 + 8.58, alifoldz score:
1.0(0.4).
7.2 Multiple Alignment 171
A
U
A
C
U
U
A
C
C
U
G
G
C
C
G
G
G
G
U C A A _ _ _ _ U G
G
G
_
U
G
A
U
C
A
A
G
A
A
G
G
C
C
C
A
U
G
G
_
C
C
U
G
G
G
U
G
A
G
G
C
A
C
C
U C C
_
A
U
U
G
C
A
C U U
C
_
_
_
G
A
G
G
G
G
G
C
C
G
ACCCCUA
C
G
A
U
C
U
C
C
C
C A
A
G
U
G
G G
G
G
A
A
A
_
_
_
C
G
A
C
G
U
C
A
U
A
A U U U G U
G
G
U
_
_
_
A
G
U
G
G
G
G
G
C
C
U
G
C
G
U
U
C
_
G
C
G C G G C C
C
C
U
U
_
_
_
_
_
C
G
U
C
G
C
C
G
C
G
C
G
G
C
G
C
G
C
G
U
G
C G
U G
C
Figure 7.33: U1 RNA. RNAalifold prediction for the ClustalW alignment of seed
sequences taken from the Rfam database. RNAalifold score: 5.02 = 1.98 +
3.05, alifoldz score: 0.9(n.v.).
A
U
A
C
U
U
A
C
C
U
G
G
A
C
G
_
_
_
G
_
G
G
U
C
A
A U G
G
G
C
G
A
U
C
A
A
G
A
A
G
A
C
C
C
A U G
G
_
C
_
U
A
G
G
U U
A
G
U
G
A
C
C
U
C
C A
U
U
G
C
A
C U
U
A
_
_
G
G
A_
_
G
GG
G
U
G
C
C
U
_
G
C
C
U
A
A
G
G
U
C
U
G
C
C
C
A
A
G
U G
G
U
A
G
A
G
C
C
U A C
G
U
C
A
U A A
U
U
U
G
U
G
G
C
A
G
U
G
G
G
G
G
C
C
U
G
C
G
U
U
C G
_
C
G
C
G
G
C
C
C
C
U
_
C _
C
G
G
U
A
U
U
G
C
G
C
G
C
G
C
G
G
U
C
G
G
C
C
G
Figure 7.34: U1 RNA. RNAalifold prediction for the RNAforester sequence align-
ment for the consensus structure in Figure 7.31. RNAalifold score: 36.50 =
31.85 +4.65, alifoldz score Z: 5.3(1.3).
172 Applications
A
U
A
C
U
U
A
C
C
U
G
G
A
C
G
G
G
G
U
C
A
A
U
G
G
G
C
G
A
U C
A
A
G
A
A
G
G
C
C
C
A
U
G G
_ C
C
U
A
G
G
U
U
A
G
U
G
A
C
C
U
C
C A
U
U
G
C
A
C U
U
A
_
_
G
G
A
G
GG
G
U
G
CC
U
_
G
C
C
U
A
A
G
G
U
C
G
G
C
C
C
A
A
G
U G
G
U
A
G
A
G
C
C
U
A
C
G
U
C
A
U A A
U
U
U
G
U
G
G
C
A
G
U
G
G
G
G
G
C
C
U
G
C
G
U
U
C _
G
C
G
C
G
G
C
C
C
C
U
G
C _
C
G
U
A
C
G
C
G
G
C
C
G
G
U
G
U
G
C
U
G
C
G
C
G
C
G
G
C
Figure 7.35: U1 RNA. RNAalifold prediction for the ClustalW alignment for the
sequences belonging to the consensus structure in Figure 7.31. RNAalifold score:
51.73 = 45.54 +6.19, alifoldz score: 5.3(2.6).
A
U
A
C
U
U
A
C
C
U
G
G
A
C
G
G
_
G
G
U
C
_ _ A A
U
G
G
G
C
G
A
U C
A
A
G
A
A G
G
C
C
C
A
U
G
G
C
C
U
_
A
G
G
U
U
A
G
U
G
A
C
C
U
C
C
A
_
U
U
G
C
A
C U
U
_
_
A
G
G
A
G
G_
_
G
G
U
G
CC
U
_
G
C
C
UA _
A
G
G
U
C
G
G
C
C
C
A
A
_
_
G
U G
G
_
_ U
C
G
_
A G
C
C
U
_
A
C
G
U
C
A
U A A U
U
U
G
U
G
G
C
A
G
_
_
_
G
G
G
GG
G
_
C
U
G
C
G
U
U
_
C
_
G
C
G
C
G
G
C
C
C
C
U U C
C
G
U
A
C
G
C
G
G
C
C
G
G
U
C
G
G
U
U
G
G
C
A
U
G
U G
C
U
G
C
G
C
G
C
G
Figure 7.36: U1 RNA. RNAalifold prediction for the Rfam seed alignment
restricted to sequences belonging to the consensus structure in Figure 7.31.
RNAalifold score: 34.18 = 27.36 +6.82, alifoldz score: 7.8(4.2).
7.2 Multiple Alignment 173
Discussion
Evidently, the sequence alignment strategy is not a successful strategy to
predict a consensus structure for RNA families that are distantly related
(applied to the complete Rfam seed sequences). For the structure alignment
strategy, the RNAforester cluster with the highest sequence diversity was
always in good correspondence with the published consensus structure. The
clusters that are not shown for the TPP riboswitch and the U1 splicosomal
RNA were either diverse in their sequence and similar to the published struc-
ture
4
, or similar in their sequence with a structural topology that is dierent
to the published one. It seems to be unlikely that dierent sequences fold
into a similar structure just by chance. Interestingly, RNAalifold was not
able to repredict all stems of the consensus structure for the TPP riboswitch
and the U1 splicosomal RNA for the hand-crafted seed alignments taken from
the Rfam database.
To assess the quality of the sequence alignment that can be derived from
RNAforesters best cluster, I ran RNAalifold on the sequence alignment that
was derived from the structural alignment. Additionally, I considered the
RNAalifold predictions for the ClustalW alignment and the resticted seed
alignment for the sequences belonging to this cluster. The predictions from
the ClustalW alignments achieved a similar quality as the predictions from
the RNAforester derived sequence alignments. However, the RNAalifold pre-
dictions detected dierent parts of the consensus structure. In particular, for
the Lysine riboswitch, a stem that was detected with the RNAforester se-
quence alignment was not detected with the ClustalW alignment, and vice
versa (see Figure 7.20 and 7.21). What remains is to observe whether the
improved quality of the ClustalW alignments is simply due to a reduced se-
quence identity or a good pre-selection by RNAforester. In contrast to the
unrestricted seed alignments, the restricted seed alignments let RNAalifold
predict consensus structures that are in almost perfect correspondence to the
published ones.
4
A tuning of the RNAforester parameters could join them in a larger cluster.
174 Applications
My initial strategy was to use the zscores as a measure for the quality of
the alignment. In contrast to my expectation, the zscores did not strongly
identify the (unrestricted) seed alignments as an alignment of functional non-
coding RNA sequences (A zscore below 4 would be a good indicator). The
restricted seed alignments always achieved negative scores that gave strong
evidence for a functional RNA.
Alignments of predicted minimal free energy structures can rightfully be
criticized, because structure prediction may produce optimal structures
quite dierent to the (suboptimal) native structure. The use of sequence
similarity, if sucient, is advocated as a means to avoid this dilemma. How-
ever, my experiments contribute two new considerations to this issue:
They demonstrate an eect that, at the rst sight, is paradoxical:
strong sequence similarity can mislead the determination of the con-
sensus structure. This happens because very similar sequences tend to
fold into a similar structure, be it wrong or right.
They demonstrate that a multiple structure alignment when applying
the cuto value in the clustering step, may produce meaningful align-
ments even in the presence of incorrect predictions.
As a consequence, a new approach to consensus construction becomes feasi-
ble, where rst a good candidate consensus (or several) is constructed and
subsequently, sequences that do not fall into a consensus cluster are refolded,
given the candidate consensus as a target structure.
Chapter 8
Conclusions
In this thesis, I have analyzed the tree alignment model for the comparison
of RNA secondary structures. I gave a systematic generalization of the align-
ment model from strings to trees and forests. I provided carefully engineered
dynamic programming implementations using dense, two-dimensional tables
which considerably reduces the space requirement. I introduced local simi-
larity problems on forests and provided ecient algorithms that solve them.
Since the problem of aligning trees occurs in many dierent disciplines, I
untied my algorithmic contributions from the problem of aligning RNA sec-
ondary structures. For instance, using my algorithms I could contribute to
address problems in the eld of robotics [48, 165].
However, the main focus of this thesis is to provide algorithms to analyze
RNA secondary structures. To improve the biological semantic of aligning
RNA secondary structures as forests, I introduced an extended forest repre-
sentation and a rened forest alignment model. The local similarity variants
that were introduced on an abstract level of forests turned into local simi-
larity notions for RNA secondary structures. The joined work with Thomas
Toller showed that local structural motifs in RNA molecules can be success-
fully detected using my algorithms [203]. To make the results of structure
comparison visually available, I invented a 2d-plot for RNA secondary struc-
ture alignments that highlights the dierences and similarities of structures.
This visualization is more intuitive than comparing abstract representations
176 Conclusions
of RNA secondary structures, e.g. dot-plots, mountain plots, and makes it
ecient to present results from structure comparison.
I generalized the forest alignment model to the case of multiple forests
and, thus, made it applicable to compare multiple RNA secondary structures.
My approach is a faithful generalization of established techniques used in
sequence comparison. All the experience that has accumulated for multiple
sequence alignments therefore carries over now to RNA secondary structures.
I generalized the idea of sequence proles to forests proles, resulting in a
prole of RNA secondary structures which groups dierent RNA secondary
structures into a single data structure. To visualize a common consensus
structure, I proposed a 2d-plot visualization that, in addition to structural
similarity, can display the sequence diversity of the aligned structures. Based
on these techniques, I proposed a consensus structure prediction strategy for
families of RNA molecules that have low sequence homology. I demonstrated
that this is a promising approach by successfully predicting the consensus
structures for low sequence conserved RNA families taken from the Rfam
database.
I implemented all algorithms presented in this thesis in the RNA struc-
ture comparison tool RNAforester. RNAforester is designed in spirit of
the programs in the Vienna RNA package and will be distributed in the
forthcoming Vienna RNA Package Version 1.6. The online version and the
stand-alone application is publicly available at http://bibiserv.techfak.
uni-bielefeld.de/rnaforester.
Future Work Several research activities open directly from the contribu-
tions in this thesis:
The success of the structure prediction strategy that was presented
in this thesis depends largely on the quality of thermodynamic pre-
dictions. It is well known that the biologically meaningful structure
often hides in the space of suboptimal solutions. I argue that results
of my structure prediction strategy can be improved signicantly by
177
considering suboptimal solutions. However, the exponential number
of suboptimal solutions prohibits a straightforward strategy. Recently,
Giegerich et al. provided the structure prediction program RNAshapes
based on thermodynamics that compartmentalizes the suboptimal so-
lution space into dierent shapes [59]. A combination of RNAshapes
and RNAforester is the logically next step.
That locally similar structures can be detected with RNAforester with-
out prior knowledge was demonstrated in this thesis. The application
of my algorithms on a genome-wide scale is a challenging task. Lo-
cally stable structures could be predicted on genome-wide surveys us-
ing RNALfold [85] and the resulting data could be analyzed for locally
conserved structures using RNAforester, after it has been preprocessed
for length and energy constraints. Thorough statistics have to be done
to rank the locally conserved structures and distinguish biologically
relevant conservations from those that are found just by chance.
A well known problem of the progressive strategy is that errors made
early in an alignment cannot be rectied when further sequences are
added. Notredame et al. present a strategy that can minimize this
eect in the multiple sequence alignment tool T-Coee [149]. Instead
of using substitution scores for the calculation of pairwise alignments,
they propose a position dependent scoring. A primary library gathers
information from heterogeneous sources for pairwise alignments, such
as sequence alignments (global and local), structural alignments and
manual alignments. These sources are combined in an extended li-
brary such that each pair of characters in the sequences has a position
specic weight. The pairwise alignments are then optimized accord-
ing to this extended library. Misplacing gaps in the earlier steps of the
progressive calculation become less likely and signicantly improves the
quality of the alignment in comparison to ClustalW and other tools.
An analogous strategy for trees could further improve the quality of
multiple tree alignments.
178 Conclusions
Various tree distances have been discussed in the introductory chapter
of this thesis. However, a thorough analysis of their quality for RNA
secondary structures is missing. It would be interesting to observe
whether, and under which circumstances, the distances can be replaced
by each other and provide similar results. The complexities of the tree
distances depend on dierent parameters of the tree structure, e.g. the
number of nodes, the depth, the number of leaves, and the degree. All
these parameters are known and, thus, the computational eort can be
determined in advance. At the end, a exible strategy could always
chose the cheapest model.
Today, the detection of unknown non-coding RNA from genomic data is
one of the biggest challenges in molecular biology. First successes were
achieved with tools that infer a structure from a (multiple) sequence
alignment by thermodynamic and phylogenetic information, comparing
the result of the predictions with randomized data [163, 223]. However,
there is an inherent problem: If the sequences are highly conserved,
the alignment is good but the covariance of base-paired regions is low.
Thus, the thermodynamic considerations dominate the structure pre-
diction. Unlike stated by Maizel and coworkers, energy seems not to be
a good discriminator to separate structural from non structural RNAs
[19, 162]. If the sequence conservation is too low, regions of covariance
are not aligned accurate and the alignment can mislead the predictions.
As in my structure prediction strategy, I am thinking about a strategy
that goes the other way around: I could start with thermodynamic
considerations and then use phylogenetic information to estimate the
reliability of predictions.
A multiple and local structure alignment program will become a basic tool,
just like the sequence counterparts. With RNAforester, I provide a program
that can be embedded in a larger framework of structure analysis, contribut-
ing to solve problems beyond the ones I proposed.
Bibliography
[1] K Akoi. A top-down algorithm to compute the distance between trees.
Trans. IECE, 66:4956, 1983. (in Japanese).
[2] J. Alber, J. Gramm, J. Guo, and R. Niedermeier. Computing the
similarity of two sequences with nested arc annotations. Theoretical
Computer Science, 312:337 358, 2004.
[3] L. Alonso and R. Schott. On the tree inclusion problem. Acta Infor-
matica, 37:653670, 2001.
[4] V. Bafna, S. Muthukrishnan, and R. Ravi. Computing similarity be-
tween RNA strings. In Proc. of CPM95, LNCS 937, pages 116, 1995.
[5] D. Baltimore. RNA-dependent DNA polymerase in virions of RNA
tumour viruses. Nature, 226:12091211, 1970.
[6] D. Barash. Second eigenvalue of the Laplacian matrix for predicting
RNA conformational switch by mutation. Bioinformatics, 20(12):1861
1869, 2004.
[7] D.T. Barnard, G. Clarke, and N. Duncan. Tree-to-tree Correction for
Document Trees. Technical report, Queens University, Department
of Computing and Information Science, Kingston, Ontario K7L 3N6,
Canada, January 1995.
[8] R.T Batey and J.A. Doudna. The parallel universe of RNA folding.
Nature struct. Biol., 5:337340, 1998.
[9] R. Bellman. Dynamic Programming. Princeton University Press, 1957.
[10] P. Billi. Ordered Tree Edit Distance with Merge and Split Operations.
Technical report, IT University of Copenhagen, September 2003. ISSN
1600-6100, ISBN 87-7949-048-4.
180 BIBLIOGRAPHY
[11] P. Billi. Tree Edit Distance, Alignment Distance and Inclusion. Techni-
cal report, IT University of Copenhagen, 2003. ISSN 1600-6100, ISBN
87-7949-032-8.
[12] P. Bonizzoni and G. Della Vedova. The Complexity of Multiple Se-
quence Alignment with SP-Score that is a Metric. Theoretical Com-
puter Science, 259(1):6379, 2001.
[13] P. Brion and E. Westhof. Hirarchy and dynamics of RNA folding. A.
Rev. Biophys. Bioimol. Struct., 26:113137, 1997.
[14] R.E. Bruccoleri and G. Heinrich. An improved algorithm for nucleic
acid secondary structure display. Comp. Appl. Biosci., 4:167173, 1988.
[15] H. Carillo and D.J. Lipman. The multiple sequence alignment problem
in biology. SIAM Journal of Applied Mathematics, 48:10731082, 1988.
[16] J.C. Carrington and V. Ambros. Role of microRNAs plant and animal
development. Science, 301:336338, 2003.
[17] T.R. Cech. RNA as an enzyme. Scientic American, 5:6475, 1986.
[18] G.J.S. Chang, G. Patel, L. Relihan, and J.T.L. Wang. A graphical en-
vironment for change detection in structured documents. In R. Baeza-
Yates et al., editor, Proceedings of the 21st Annual International Com-
puter Software and Applications Conference (COMPSAC97), pages
536541, Washington, DC, August 1997.
[19] J.H. Chen, B. Shapiro, K.M. Currey, and J.V Maizel. A computational
procedure for assessing the signicance of RNA secondary structure.
Bioinformatics, 6:718, 1990.
[20] W. Chen. More ecient algorithm for ordered tree inclusion. Journal
of Algorithms, 26:370385, 1998.
[21] W. Chen. New algorithm for ordered tree-to-tree correction problem.
Journal of Algorithms, 40:135158, 2001.
[22] Y.C. Cheng and S.Y. Lu. Waveform correlation by tree matching. In
Proceedings of the 9th Annual International Computer Software and
Application Conference, pages 299305, 1985.
[23] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J Gibson, D.G. Hig-
gins, and J.D. Thompson. Multiple sequence alignment with the clustal
series of programs. Proc. Nat. Acad. Sci. U.S.A., 31(13):34973500,
2003.
BIBLIOGRAPHY 181
[24] F. Chetouani, P. Monestie, P. Thebault, C. Gaspin, and B. Michot.
ESSA: an integrated and interactive computer tool for analysing RNA
secondary structure. Nucleic Acids Res., 25(17):35143522, 1997.
[25] C. Chevalet and B. Michot. An algorithm for comparing RNA sec-
ondary structures and search for similar substructures. Comp. Appl.
Biosci., 8(3):215225, 1992.
[26] D.K.Y. Chiu and T. Kolodziejczak. Inferring consensus structure from
nucleic acid sequences. Comp. Appl. Biosci., 7:347352, 1991.
[27] S. Chowdhury, C. Ragaz, E. Kreuger, and F. Narberhaus.
Temperature-controlled Structural Alterations of an RNA Thermome-
ter. J. Biol. Chem., 278:4791547921, 2003.
[28] R. Cole, R. Hariharan, and P. Indyk. Tree pattern matching and subset
matching in deterministic O(nlog
3
n) time. In ACM-SIAM Symposium
on Discrete Algorithms (SODA99), pages 245254, 1999.
[29] G. Collins, S.Y. Le, and K. Zhang. A new algorithm for computing
similarity between RNA structures. Information Sciences, 139:5977,
2001.
[30] F. Crick. Central Dogma of Molecular Biology. Nature, 227:561563,
1970.
[31] M.O. Dayho, R.M. Schwartz, and B.C Orcutt. A model of evolu-
tionary change in proteins. In M.O. Dayho, editor, Atlas of Protein
Sequence and Structure, volume 5, pages 345352. National Biomedical
Research Foundation, 1978.
[32] P. de Rijk, Y.V. de Peer, S. Chapelle, and R. de Wachter. Database
on structure of small ribosomal subunit RNA. Nucleic Acids Res.,
22:34953501, 1994.
[33] P. de Rijk and R. de Wachter. RnaViz, a program for the visualisation
of RNA secondary structure. Nucleic Acids Res., 25(22):46794684,
1997.
[34] P. de Rijk, J. Wuyts, and R. de Wachter. RnaViz2: an improved
representation of RNA secondary structure. Bioinformatics, 19(2):299
300, 2003.
182 BIBLIOGRAPHY
[35] R. Dirks and N. Pierce. A partition function algorithm for nucleic adic
secondary struture, including pseudoknots. Journal of Computational
Chemistry, 24:16641677, 2003.
[36] P. Doty, H. Boedtker, J.R. Fresco, R. Haselkorn, and M. Litt. Sec-
ondary structure in ribonucleic acids. In Proc. Nat. Acad. Sci. U.S.A.,
volume 45, pages 482499, 1959.
[37] D. Dowell and S.R. Eddy. Evaluation of several lightweight stochastic
context-free grammars for RNA secondary structure prediction. BMC
Bioinformatics, 5(71), 2004.
[38] M. Dubiner, Z. Galil, and E. Magen. Faster Tree Pattern Matching.
In Proc. of FOCS90, pages 145150, 1990.
[39] S. Dulucq and L. Tichit. RNA secondary structure comparison: exact
analysis of the Zhang-Sasha tree edit-algorithm. Theoretical Computer
Science, 306:471484, 2003.
[40] S. Dulucq and H. Touzet. Analysis of Tree Edit Distance Algorithms.
In Proc. of CPM03, LNCS 2676, pages 8395. Springer-Verlag, 2003.
[41] R. Durbin, S. Eddy, A. Krogh, and G. Mitchinson, editors. Biological
Sequence Analysis. Cambridge University Press, UK, 1998.
[42] S. Eddy and R. Durbin. RNA sequence analysis using covariance mod-
els. Nucleic Acids Res., 22(11):20792088, 1994.
[43] S.R. Eddy. Non-Coding RNA Genes and the Modern RNA World.
Nature Reviews Genetics, 2(12):919929, 2001.
[44] A. Edvardsson, P. Gardner, A.M. Poole, M.D. Hendy, D. Penny, and
V. Moulton. A Search for H/ACA SnoRNAs in Yeast Using MFE
Secondary Structure Prediction. Bioinformatics, 19(7):865873, 2003.
[45] P. Evans. Algorithms and Complexity for Arc Annotated Sequence
Analysis. PhD thesis, University of Victoria, 1999.
[46] D. Feng and F. Doolittle. Progressive sequence alignment as a pre-
requisite to correct to correct phylogenetic trees. Journal of Molecular
Evolution, 25:351360, 1987.
[47] D. Fera, N. Kim, N. Shieldrim, J. Zorn, U. Laseron, H.H. Gan, and
T. Schlick. RAG: RNA-As-Graphs web resource. BMC Bioinformatics,
5:88, 2004.
BIBLIOGRAPHY 183
[48] M. Ferch, J. Zhang, and M. Hochsmann. Learning cooperative assembly
with the graph representation of a state-action space. In IEEE/RSJ
International Conference on Intelligent Robots and Systems, 2002.
[49] C. Flamm, W. Fontana, I.L. Hofacker, and P. Schuster. RNA folding
at elementary step resolution. RNA, 6:325338, 2000.
[50] J.R. Fresco, B.M. Alberts, and P. Doty. Some molecular details of the
secondary structure of ribonucleic acid. Nature, 188:98101, 1960.
[51] H. Gabow. Implementation of Algorithms for Maximum Matching on
Nonbipartite Graphs. PhD thesis, Stanford University, 1973.
[52] H.H. Gan, D. Fera, J. Zorn, N. Shieldrim, M. Tang, U. Laserson,
N. Kim, and T. Schlick. RAG: RNA-As-Graphs databaseconcepts,
analysis, and features. Bioinformatics, 20(8):12851291, 2004.
[53] H.H. Gan, S. Pasquali, and T. Schlick. Exploring the repertoire of
RNA secondary motifs using graph theory with implications for RNA
design. Nucleic Acids Res., 31:29262943, 2004.
[54] P. Gardner. Simulating the RNA-world and Computational Ribo-
nomics. PhD thesis, Massey University, Palmerston North, New
Zealand., 2003.
[55] P.P. Gardner and R. Giegerich. A comprehensive comparison of com-
parative RNA structure prediction approaches. BMC, 5(1):140, 2004.
[56] R. Giegerich. A Systematic Approach to Dynamic Programming in
Bioinformatics. Bioinformatics, 16:665677, 2000.
[57] R. Giegerich and D. Evers. RNA Movies: visualizing RNA secondary
structure spaces. Bioinformatics, 15(1):3237, 1999.
[58] R. Giegerich and C. Meyer. Algebraic dynamic programming. In Hel`ene
Kirchner and Christophe Ringeissen, editors, Algebraic Methodology
And Software Technology, 9th International Conference, AMAST 2002,
pages 349364, Saint-Gilles-les-Bains, Reunion Island, France, 2002.
Springer LNCS 2422.
[59] R. Giegerich, B. Voss, and M. Rehmsmeier. Abstract Shapes of RNA.
Nucleic Acids Res., 20:48434851, 2004.
[60] W. Gilbert. Origin of Life, The RNA World. Nature, 319:618, 1986.
184 BIBLIOGRAPHY
[61] GNU Build Tools: autoconf and automake. www.gnu.org.
[62] O. Gotoh. Signicant improvement in accuracy of multiple protein
sequence alignments by iterative renements as assessed by reference
to structural alignments. J. Mol. Biol., 264:823838, 1996.
[63] O. Gotoh. Multiple sequence alignment: algorithms and applications.
Adv Biophys., 36:159206, 1999.
[64] Graphviz - Graph Visualization Software. www.graphviz.org.
[65] N.K. Gray and M. Wickens. Control of Translation Initiation in Ani-
mals. Annu. Rev. Cell Dev. Biol., 14:399458, 1998.
[66] M. Gribskov, A. McLachlan, and D. Eisenberg. Prole analysis detec-
tion of distantly related Proteins. In Proc. Nat. Acad. Sci. U.S.A.,
volume 88, pages 5558, 1987.
[67] S. Griths-Jones, A. Bateman, M. Marshall, A. Khanna, and S.R.
Eddy. Rfam: an RNA family database. Nucleic Acids Res., 31(1):439
441, 2003.
[68] S. Griths-Jones, S. Moxon, M. Marshall, A. Khanna, Eddy S.R.,
and A. Bateman. Rfam: annotating non-coding RNAs in complete
genomes. Nucleic Acids Res., 33:D121D124, 2005.
[69] J. Guhaniyogi and G. Brewer. Regulation of mRNA stability in mam-
malian cells. Gene, 265:1123, 2001.
[70] A.P. Gultyaev, F.H.D. vanbatenburg, and C.W.A. Pleij. The computer
simulation of RNA folding pathways using a genetic algorithm. J. Mol.
Biol., 250:3751, 1995.
[71] D. Guseld, K. Balasubramanian, and D. Naor. Parametric optimiza-
tion of sequence alignment. Algorithmica, 12:312326, 1994.
[72] K. Han and Y. Byun. PSEUDOVIEWER2: visualization of RNA pseu-
doknots of any type. Nucleic Acids Res., 31(13):34323440, 2003.
[73] K. Han and H.-J. Kim. Prediction of common folding structures of
homologous RNAs. Nucleic Acids Res., 21:12511257, 1993.
[74] K. Han, Y. Lee, and W. Kim. PseudoViewer: automatic visualization
of RNA pseudoknots. Bioinformatics, 18(1):321328, 2002.
BIBLIOGRAPHY 185
[75] S. Heniko and J.G. Heniko. Amino acid substitution matrices from
protein blocks. In Proc. Nat. Acad. Sci. U.S.A., volume 89, pages
109159, 1992.
[76] D.S. Hirschberg. Algorithms for the Longest Common Subsequence
Problem. Journal of the ACM, 24(2):664 675, 1977.
[77] M. H ochsmann, T. Toller, R. Giegerich, and S. Kurtz. Local Similarity
in RNA Secondary Structures. In Proc. of CSB03, pages 159168,
2003.
[78] M. H ochsmann, B. Voss, and R. Giegerich. Pure Multiple RNA
Secondary Structure Alignments: A Progressive Prole Approach.
IEEE/ACM Transactions on Computational Biology and Bioinformat-
ics, 1(1):5362, 2004.
[79] I.L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids
Res., 31(13):34293431, 2003.
[80] I.L. Hofacker, H.F. Bernhart, and P.F. Stadler. Alignment of RNA Base
Pairing Probability Matrices. Bioinformatics, 20:22222227, 2004.
[81] I.L. Hofacker, M. Fekete, C. Flamm, M.A. Huynen, A. Rauscher, P.E.
Stolorz, and P.F. Stadler. Automatic Detection of Conserved RNA
Structure Elements in Complete RNA Virus Genomes. Nucleic Acids
Res., 26:38253836, 1998.
[82] I.L. Hofacker, M. Fekete, and P. Stadler. Secondary structure pre-
diction for aligned RNA sequences. J. Mol. Biol., 319(5):10591066,
2002.
[83] I.L. Hofacker, M. Fekete, and P.F. Stadler. Secondary Structure Predic-
tion for Aligned RNA Sequences. J. Mol. Biol., 319:10591066, 2002.
[84] I.L. Hofacker, W. Fontana, P.F. Stadler, S. Bonhoeer, M. Tacker,
and P. Schuster. Fast Folding and Comparison of RNA Secondary
Structures. Monatshefte f. Chemie, 125:167188, 1994.
[85] I.L. Hofacker, B. Priwitzer, and P.F. Stadler. Prediction of Locally
Stable RNA Secondary Structures for Genome-wide Surveys. Bioin-
formatics, 20:191198, 2004.
[86] C.M. Homann and M.J. ODonnell. Pattern Matching in Trees. Jour-
nal of the ACM, 29:6895, 1982.
186 BIBLIOGRAPHY
[87] P. Hogeweg and B. Hesper. Energy directed folding of RNA sequences.
Nucleic Acids Res., 12:6774, 1984.
[88] P. Hogeweg and B. Hesper. The alignment of sets of sequences and the
construction of phyletic trees: an integrated method. J. Mol. Biol.,
20(2):175186, 1984.
[89] S.R. Holbrook and S.H. Kim. RNA crystallography. Biopolymers,
44:321, 1997.
[90] R.C. Holt and J.R. Cordy. The Turing Programming Languag. Com-
munications of the ACM, 31(12):14101423, 1988.
[91] R.P. Jansen. mRNA Localization: Message On The Move. Nature Rev.
Mol. Cell Biol., 2:247256, 2001.
[92] J. Jansson and A. Lingas. A fast algorithm for optimal alignment
between similar ordered trees. Fundamenta Informaticae, Special issue
on computing patterns in strings, 56(1,2):105120, 2003.
[93] T. Jiang, G. Lin, B. Ma, and K. Zhang. The longest common subse-
quence problem for arc-annotated sequences. J. Discrete Algorithms,
2(2):257270, 2004.
[94] T. Jiang, G.-H. Lin, B. Ma, and K. Zhang. A General Edit Distance
between RNA Structures. J. Comp. Biol., 9(2):371388, 2002.
[95] T. Jiang, J.T.L. Wang, and K. Zhang. Alignment of Trees - An Al-
ternative to Tree Edit. Theoretical Computer Science, 143(1):137148,
1995.
[96] P. Kilpel ainen. Tree Matching Problems with Applications to Structured
Text Databases. PhD thesis, University of Helsinki, Department of
Computer Science, November 1992.
[97] P. Kilpel ainen and H. Mannila. Grammatical Tree Matching. In Proc.
of CPM92, LNCS 644, pages 159171, 1992.
[98] P. Kilpel ainen and H. Mannila. Ordered and Unordered Tree Inclusion.
SIAM Journal on Computing, 24(2):340356, 1995.
[99] S.H. Kim, G.J. Suddath, G.J. Quigley, A. McPherson, J.L. Sussmann,
A.H.J. Wang, N.C. Seeman, and A. Rich. Three-dimensional tertiary
structure of yeast phenylalanine transfer RNA. Science, 185:435439,
1974.
BIBLIOGRAPHY 187
[100] J. Kjems and J. Egebjerg. Modern methods for probing RNA structure.
Cur. Opin. Biotech, 9:5965, 1996.
[101] R.D. Klausner, T.A. Rouault, and J.B. Harford. Regulating the fate
of mRNA: The Control of Cellular Iron Metabolism. Cell, 72:1928,
1993.
[102] P. Klein. Computing the Edit-Distance between Unrooted Ordered
Trees. In Proc. of ESA98, LNCS 1461, pages 91102. Springer-Verlag,
1998.
[103] R.J. Klein and S.R. Eddy. RSEARCH: Finding homologs of single
structured RNA sequences. BMC Bioinformatics, 4(44), 2003.
[104] B. Knudsen and J. Hein. RNA secondary structure prediction using
stochastic context-free grammars. Nucleic Acids Res., 31(13):34238,
2003.
[105] D.A.M. Konings and P. Hogeweg. Pattern analysis of RNA secondary
structures: Similarity and consensus of minimal-energy folding. J. Mol.
Biol., 207:597614, 1989.
[106] S.R. Kosaraju. Ecient Tree Pattern Matching. In Proc. of FOCS89,
pages 178183, 1989.
[107] L. Kretzner, A. Krol, and M. Rosbash. Saccharomyces cerevisiae U1
small nuclear RNA secondary structurecontains both universal and
yeast-specic domains. In Proc. Nat. Acad. Sci. U.S.A., volume 87,
pages 851855, 1990.
[108] S. Kurtz. Foundations of Sequence Analysis, 2000. Lecture notes for a
course in the Wintersemester 2000/2001, Technische Fakultat, Univer-
sit at Bielefeld.
[109] G. Lapalme, R.J. Cedergren, and D. Sanko. An algorithm for the dis-
play of nucleic acid secondary structure. Nucleic Acids Res., 10:8351
8356, 1982.
[110] S.Y. Le, R. Nussinov, and J.V Mazel. Tree graphs of RNA secondary
structures and their comparison. Computational Biomedical Research,
22:461473, 1989.
188 BIBLIOGRAPHY
[111] S.Y. Le, J. Owens, R. Nussinov, J.H Chen, B. Shapiro, and J.V Mazel.
RNA secondary structures: Comparisons and determinations of fre-
quently recuring substructures by consensus. Comp. Appl. Biosci.,
5:205210, 1989.
[112] S.Y. Le and M. Zuker. Predicting common foldings of homologous
RNA. J Biomol Struct Dyn., 8(5):10271044, 1991.
[113] V.I. Levenshtein. Binary codes capable of correcting deletions, inser-
tions and reversals. Dokl. Akad. Nauk USSR, 163:845848, 1965. (Rus-
sian), English Translation (1966), Cybernetics and Control Theory, 10,
707-710.
[114] G.-H. Lin, Z.-Z. Chen, T. Jiang, and J. Wen. The longest common
subsequence problem for sequences with nested arc annotations. In
Proc. of ICALP01, pages 444455, 2001.
[115] D.J. Lipman, S.F. Altschul, and J.D. Kececioglu. A tool for multiple
sequence alignment. Proc. Nat. Acad. Sci. U.S.A., 86:44124415, 1989.
[116] T.M. Lowe. GtRDB: the genomic tRNA database. World Wide Web,
2002. GtRDB: the genomic tRNA database.
[117] T.M. Lowe and S.R. Eddy. tRNAscan-SE: a program for improved
detection of transfer RNA genes in genomic sequence. Nucleic Acids
Res., 25(5):955964, 1997.
[118] S.Y. Lu. A tree-to-tree distance and its application to cluster analy-
sis. IEEE Transcations on Pattern Analysis and Machine Intelligence,
1:219224, 1979.
[119] F. Luccio, A. Mesa, P. Olivares, and L. Pagli. Exact rooted subtree
matching in sublinear time. In Proc. of the 9-th. ANaCC/ACM/IEEE
International Congress on Computer Science Research (CIICC02),
pages 2735, 2002.
[120] R. L uck, S. Graf, and G. Steger. ConStruct: A tool for thermodynamic
controlled prediction of conserved secondary structure. Nucleic Acids
Res., 21:42084217, 1999.
[121] R. L uck, G. Steger, and D. Riesner. Thermodynamic prediction of
conserved secondary structure: Application to RRE-element of HIV,
tRNA-like element of BMV, and mRNA of prion protein. J. Mol. Biol.,
258:813826, 1996.
BIBLIOGRAPHY 189
[122] R.B. Lyngs and C.N.S. Pedersen. Pseudoknot prediction in energy
based models. J. Comp. Biol., 7:409 427, 2000.
[123] B. Ma, L. Wang, and K. Zhang. Computing similarity between RNA
structures. Theoretical Computer Science, 276:111132, 2002.
[124] D. Maier. The complexity of some problems on subsequence and su-
persequence. Journal of the ACM, 25(2):322 336, 1978.
[125] E. Makinen. On the subtree isomorphism problem for ordered trees.
Information Processing Letters, 32:271273, 1989.
[126] M. Mandal, B. Boese, J.E. Barrick, W.C. Winkler, and R.R. Breaker.
Riboswitches control fundamental biochemical pathways in Bacillus
subtilis and other bacteria. Cell, 113(5):577586, 2003.
[127] A. Marian. Detecting Changes in XML Documents. In Proceedings
of the 18th International Conference on Data Engineering (ICDE02),
page 41, 2002.
[128] O. Martinez and B. Goud. Rab proteins. Biochim. Biophys. Acta,
1404:101112, 1998.
[129] D. Mathews, J. Sabina, M. Zuker, and H. Turner. Expended sequence
dependence of thermodynamic parameters provides robust prediction
of RNA secondary structures. J. Mol. Biol., 288:911940, 1999.
[130] D.H. Mathews. Predicting a set of minimal free energy RNA secondary
structures common to two sequences. Bioinformatics, Advance Online
Access, 2005.
[131] D.H. Mathews and D.H. Turner. Dynalign: an algorithm for nding
the secondary structure common to two RNA sequences. J. Mol. Biol.,
317(2):191203, 2002.
[132] O. Matzura and A. Wennborg. RNAdraw: an integrated program for
RNA secondary structure calculation and analysis under 32-bit Mi-
crosoft Windows. Comp. Appl. Biosci., 12(3):247249, 1996.
[133] J.S. McCaskill. The equilibrium partition function and base pair bind-
ing probabilities for RNA secondary structures. Biopolymers, 29:1105
1119, 1990.
190 BIBLIOGRAPHY
[134] J.P. McCutcheon and S.R. Eddy. Computational identication of non-
coding RNAs in Saccharomyces cerevisiae by comparative genomics.
Nucleic Acids Res., 31(4):41194128, 2003.
[135] C. Meyer and R. Giegerich. Matching and Signicance Evaluation of
combined sequence-structure motifs in RNA. Z. Phys. Chem., 216:193
216, 2002.
[136] L. Milanovic and H. Wagner. g2 - graphic library. www.gnu.org.
[137] L. Milanovic and H. Wagner. MATHPROG:
Solver for the Maximum Weight Matching Problem.
http://elib.zib.de/pub/Packages/mathprog/matching/weighted/.
[138] A.A. Mironov, L.P. Dyakonova, and A.E. Kister. A kinetic approach
to the prediction of RNA secondary structures. J Biomol Struct Dyn.,
2:953, 1985.
[139] B. Moayer and K.S. Fu. eA tree system approach for ngerprint pat-
tern recognition. IEEE Transcations on Pattern Analysis and Machine
Intelligence, 8:376387, 1986.
[140] B. Morgenstern. DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment. Bioinformatics, 15:211218,
1999.
[141] B. Morgenstern, A. Dress, and T. Werner. Multiple DNA and protein
sequence alignment based on segment-to-segment comparison. In Proc.
Nat. Acad. Sci. U.S.A., pages 1209812103, 1996.
[142] V. Moulton, M. Zuker, M. Steel, R. Pointon, and D. Penny. Metrics
on RNA structures. J. Comp. Biol., 7(1):277292, 2000.
[143] A. Nakata, K. Taura, K. Yamamoto, and A. Yonezawa. Visualization
of RNA secondary structures using highly parallel computers. Comp.
Appl. Biosci., 12(3):205211, 1996.
[144] A. Nierman and H.V. Jagadish. Evaluating structural similarity in
XML documents. In Proceedings of the 5th International Workshop on
the Web and Databases (WebDB02), 2002.
[145] A. Nocker, T. Hausherr, S. Balsiger, N.P. Krstulovic, H. Hennecke,
and Narberhaus F. A mRNA-based thermosensor controls expression
of rhizobial heat shock genes. Nucleic Acids Res., 29:48004807, 2001.
BIBLIOGRAPHY 191
[146] A. Nocker, N. P. Krstulovic, X. Perret, and F. Narberhaus. ROSE ele-
ments occur in disparate rhizobia and are functionally interchangeable
between species. Arch. Microbiol., 176:4451, 2001.
[147] H.F. Noller and C.R. Woese. Secondary structure of 16S ribosomal
RNA. Science, 212:40341, 1981.
[148] C. Notredame. Recent progress in multiple sequence alignment: a
survey. Pharmacogenomics, 3(2):131144, 2002.
[149] C. Notredame, D.G. Higgins, and J. Heringa. T-Coee: A novel
method for fast and accurate multiple sequence alignment. J. Mol.
Biol., 302:205217, 2000.
[150] C. Notredame and D.G. Hiigins. SAGA: Sequence Alignment by Ge-
netic Algorithm. Nucleic Acids Res., 24:15151524, 1996.
[151] R. Nussinov, G. Pieczenik, J.R. Griggs, and D.J. Kleitman. Algorithms
for loop matchings. SIAM Journal of Applied Mathematics, 35:6882,
1979.
[152] B. Onoa and C. Bustamente. RNA folding and unfolding. Curr. Opin.
Struct. Biol., 14(3):374379, 2004.
[153] J. Perochon-Dorisse, F. Chetouani, S. Aurel, N. Iscolo, and B. Mi-
chot. RNA d2: a computer program for editing and display of RNA
secondary structures. Comp. Appl. Biosci., 11:101109, 1995.
[154] G. Pesole, S. Liuni, G. Grillo, F. Licciulli, F. Mignone, C. Gissi, and
C. Saccone. UTRdb and UTRsite: specialized database of sequences
and functional elements of 5 and 3 untranslated regions of eukaryotic
mRNAs. Nucleic Acids Res., 30:335340, 2002.
[155] S. Peyton Jones, editor. Haskell 98 Language and Libraries The
Revised Report. Cambridge University Press, April 2003.
[156] A.M. Pyle and J.B. Green. RNA folding. Cur. Opin. Biotech, 5:303
310, 1995.
[157] R. Ramesh and R. Ramakrishnan. Nonlinear pattern matching in trees.
Journal of the ACM, 39(2):295316, 1992.
[158] J. Reeder and R. Giegerich. Design, implementation and evaluation
of a practical pseudoknot folding algorithm based on thermodynamics.
BMC Bioinformatics, 5:104104, 2004.
192 BIBLIOGRAPHY
[159] D.C. Reis, P.B. Golgher, A.S. Silva, and A.F. Laender. Automatic
Web News Extraction Using Tree Edit Distance. In Proceedings of
the 13th international conference on World Wide Web, pages 502511,
New York, August 2004.
[160] T. Richter. A New Algorithm for the Ordered Tree Inclusion Problem.
In Proc. of CPM97, LNCS 1264, pages 150166, 1997.
[161] E. Rivas and S. Eddy. The language of RNA: a formal grammar that
includes pseudoknots. Bioinformatics, 16(4):334340, 2000.
[162] E. Rivas and S.R. Eddy. Secondary structure alone is generally not
statistically signicant for the detection of noncoding RNAs. Bioinfor-
matics, 16(7):583605, 2000.
[163] E. Rivas and S.R. Eddy. Noncoding RNA gene detection using com-
parative sequence analysis. BMC Bioinformatics, 2(8), 2001.
[164] D.A. Rodionov, A.G. Vitreschak, and Gelfand MS Mironov AA. Com-
parative genomics of thiamin biosynthesis in procaryotes. New genes
and regulatory mechanisms. J. Comp. Biol., 277:4894948959, 2002.
[165] B. R ossler, J. Zhang, and M. Hochsmann. Visual Guided Grasping
and Generalization using Self-Valuing Learning. In IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems, 2002.
[166] J. Ruan, G. Stormo, and W. Zhang. An iterated loop matching ap-
proach to the prediction of RNA secondary structures with pseudo-
knots. Bioinformatics, 20:5866, 2004.
[167] Y. Sakakibara, M. Brown, R. Hughey, I. Mian, R. Underwood, and
D. Haussler. Recent methods for RNA modeling using stochastic
context-free grammars. In Proc. of CPM94, LNCS 807, pages 289
306, 1994.
[168] Y. Sakakibara, M. Brown, R.C. Underwood, I.S. Mian, and D. Haus-
sler. Stochastic Context-Free Grammars for Modeling RNA. In Pro-
ceedings of the 27th Annual Hawaii International Conference on System
Sciences (HICSS-27), pages 284294, 1994.
[169] H. Samet. Distance transform for images represented by quadtrees.
IEEE Transcations on Pattern Analysis and Machine Intelligence,
4(3):298303, 1982.
BIBLIOGRAPHY 193
[170] D. Sankho, J.B. Kruskal, S. Mainville, and R.J. Cedergren. Time
Warps, String Edits, and Macromolecules: The Theory and Practice of
Sequence Comparison, chapter An Analysis of the general Tree-Editing
Problem, pages 237252. Addison Wesley, Reading, MA, 1983.
[171] D. Sanko. Matching sequences under deletion/insertion constraints.
In Proc. Nat. Acad. Sci. U.S.A., volume 69, pages 46, 1974.
[172] D. Sanko. Simultaneous solution of the RNA folding, alignment and
protosequence problem. SIAM Journal of Applied Mathematics, pages
810825, 1985.
[173] D. Sanko, J.B. Kruskal, S. Mainville, and R.J. Cedergren. Time
Warps, String Edits, and Macromolecules: The Theory and Practice
of Sequence Comparison, chapter Fast algorithms to determine RNA
secondary structures containing multiple loops, pages 93120. Addison
Wesley, Reading, MA, 1983.
[174] J. Saraste, U. Lahtinen, and B. Goud. Localization of the small GTP-
binding protein rab1p to early compartments of the secretory pathway.
J. Cell Sci., 108:15411552, 1995.
[175] R. Sayle and J. Milner-White. RasMol: Biomolecular graphics for all.
Trends in Biochemical Sciences (TIBS), 20(9):374, 1995.
[176] A. Sczyrba, J. Kr uger, H. Mersch, S. Kurtz, and R. Giegerich. RNA-
related tools on the Bielefeld Bioinformatics Server. Nucleic Acids Res.,
31(13):37673770, 2003.
[177] S.M. Selkow. The tree-to-tree editing problem. Information Processing
Letters, 6(6):184186, 1977.
[178] B. Shapiro and K. Zhang. Comparing multiple RNA secondary struc-
tures using tree comparisons. Comp. Appl. Biosci., 6(4):309318, 1990.
[179] B.A. Shapiro, J.V. Jr. Maizel, L.E. Lipkin, K.M. Currey, and Whit-
ney C. Generating non-overlapping displays of nucleic acid secondary
structure. Nucleic Acids Res., 12:7588, 1984.
[180] Bruce Shapiro. An algorithm for comparing multiple RNA secondary
structures. Comp. Appl. Biosci., 4(3):387393, 1988.
[181] D. Shasha and K. Zhang. Fast Algorithms for the Unit Cost Editing
Distance Between Trees. Journal of Algorithms, 11:581621, 1990.
194 BIBLIOGRAPHY
[182] H. Shi and P.B. Moore. The crystal structure of phenylalanine tRNA
at 1.93 a resolution: a classic structure revisited. RNA, 6:10911105,
2000.
[183] S. Siebert and R. Backofen. MARNA: A Server for Multiple Alignment
of RNAs. In Proceedings of the German Conference on Bioinformatics,
pages 135140, 2003.
[184] T.F. Smith and M.S. Waterman. Identication of Common Molecular
Subsequences. J. Mol. Biol., 147:195197, 1981.
[185] G. Storz. An Expanding Universe of Noncoding RNAs. Science,
296(5571):12601263, 2002.
[186] J. Stoye, V. Moulton, and A.W.M. Dress. DCA: An Ecient Im-
plementation of the Divide-and-Conquer Multiple Sequence Alignment
Algorithm. Comp. Appl. Biosci., 13(6):625626, 1997.
[187] B. Stroustrup. The C++ Programming Language (3rd Edition).
Addison-Wesley, 1997.
[188] M. Sundaralingham and S. Rao. Structure and Conformation of Nucleic
Acids and Protein-Nucleic Acid Interactions, pages 1213. University
Park Press, 1075.
[189] J.E. Tabaska, R.B. Cary, H.N. Gabow, and G. Stormo. An RNA folding
method capable of identifying pseudoknots and base triple. Bioinfor-
matics, 14(8):1213, 1998.
[190] J.E. Tabaska and G.D. Stormo. Automated Alignment of RNA Se-
quences to Pseudoknotted Structures. In Proc. of ISMB97, pages
311318, 1997.
[191] K.C. Tai. The tree-to-tree corection problem. Journal of the ACM,
26:422433, 1979.
[192] Y. Takahashi, Y. Satoh, H. Suzuki, and S. Sasaki. Recognition of largest
common structural fragment among a variety of chemical structures.
Analytical Sciences, 3:2328, 1987.
[193] E Tanaka. A bottom-up computing method for the metric between
trees dened by Tai. Trans. IECE, 66:660667, 1983. (in Japanese).
[194] E Tanaka. An improved computing method for a metric between trees.
Trans. IECE, 67:157158, 1984. (in Japanese).
BIBLIOGRAPHY 195
[195] E Tanaka. The metric between trees based on strongly structure pre-
serving mapping. Trans. IECE, 67:722723, 1984. (in Japanese).
[196] E Tanaka. A computing algorithm for the tree metric based on the
structure preserving mapping. Trans. IECE, 68:317324, 1985.
[197] E. Tanaka and K. Tanaka. A metric on trees and its computing method.
Trans. IECE, 75:511518, 1982. (in Japanese).
[198] E Tanaka and K. Tanaka. The tree-to-tree editing problem. In-
ternational Journal of Pattern Recognition and Articial Intelligence,
2(2):221240, 1988.
[199] H.M. Termin and S. Mizutani. RNA-dependent DNA polymerase in
virions of Rous sarcoma virus. Nature, 226:12111213, 1970.
[200] J. Thompson, D. Higgins, and T. Gibson. CLUSTAL W: Improv-
ing the sensitivity of progressive multiple sequence alignment through
sequence weighting position-specic gap penalties and weight matrix
choice. Nucleic Acids Res., 22:46734690, 1994.
[201] I. Tinoco, O.C. Uhlenbeck, and M.D. Levine. Estimation of secondary
structure in ribonucleic acids. Nature, 230:363367, 1971.
[202] I. Jr Tinoco and C. Bustamente. How RNA folds. J. Mol. Biol.,
293:271281, 1999.
[203] T. Toller. Bioinformatik-Strategien zur Analyse regulatorischer RNA-
Motive und ihre experimentelle Validierung. PhD thesis, University of
Bielefeld, Faculty of Technology, 2003. in German.
[204] A. Torsello. Matching Hierachical Structures for Shape Recognition.
PhD thesis, University of York, 2004.
[205] A. Torsello and E.R. Hancock. Matching and Embedding through Edit-
Union of Trees. In Proceedings of the 7th European Conference on
Computer Vision, LNCS 2352, page 822. Springer-Verlag Heidelberg,
2002.
[206] A. Torsello and E.R. Hancock. Tree Edit Distance from Information
Theory. In Lecture Notes in Computer Science, volume 2726, pages
7182. Springer-Verlag Heidelberg, 2003.
[207] H. Touzet. Tree edit distance with gaps. Information Processing Let-
ters, 85:123129, 2003.
196 BIBLIOGRAPHY
[208] D.K. Treiber and J.R. Williamson. Exposing kinetic traps in RNA
folding. Curr. Opin. Struct. Biol., 9:339345, 1999.
[209] D.H. Turner, N. Sugimoto, and S.M. Freier. RNA structure prediction.
Annual Review of Biophysics and Biophysical Chemistry, 17:167192,
1988.
[210] Gabriel Valiente. An Ecient Bottom-Up Distance between Trees.
In Proc. 8th Int. Symposium on String Processing and Information
Retrieval, pages 212219. IEEE Computer Science Press, 2001.
[211] D. van Heesch. Doxigen. www.doxygen.org.
[212] T. Villa, J.A. Pleiss, and C. Guthrie. Spliceosomal snRNAs:Mg(2+)-
dependent chemistry at the catalytic core? Cell, 109:149152, 2002.
[213] R.A. Wagner and M.J. Fischer. The string-to-string correction prob-
lem. Journal of the ACM, 21:168173, 1974.
[214] J.T.L Wang, B. Shapiro, D. Shasha, K. Zhang, and K.M. Currey.
An algorithm for nding the largest common substructures of two
trees. IEEE Transcations on Pattern Analysis and Machine Intelli-
gence, 20:889895, 1998.
[215] J.T.L Wang and Zhang. Identifying consensus of trees through align-
ment. Information Sciences, 126:14, 2000.
[216] J.T.L. Wang and K. Zhang. Finding similar consensus between trees:
an algorithm ans a distance hierarchy. Pattern Recognition, 34:127
137, 2001.
[217] J.T.L Wang, K. Zhang, and C-Y. Chang. Identifying approximately
common substructures in trees based on a restricted edit distance. In-
formation Sciences, 121:367386, 1999.
[218] J.T.L Wang, K. Zhang, and G.-W. Chirn. Algorithms for Approximate
Graph Matching. Information Sciences, 82:4574, 1995.
[219] J.T.L. Wang, Kaizhong Zhang, and D. Jeong, Karpjoo. A System for
Approximate Tree Matching. IEEE Transactions on Knowledge and
Data Engineering, 6(4):559571, 1994.
[220] L. Wang and T. Jiang. On the complexity of multiple sequence align-
ment. J. Comp. Biol., 1(4):337348, 1994.
BIBLIOGRAPHY 197
[221] L. Wang and J. Zhao. Parametric alignment of ordered trees. Bioin-
formatics, 19(17):22372245, 2003.
[222] Z. Wang and K. Zhang. Alignment between Two RNA Structures. In
Proc. of FOCS01, pages 690702, 2001.
[223] S. Washietl and I. Hofacker. Consensus Folding of Aligned Sequences as
a New Measure for the Detection of Functional RNAs by Comparative
Genomics. J. Mol. Biol., 342(1):1930, 2004.
[224] M.S. Waterman. Computer analysis of nucleic acid sequences. Methods
in Enzymology, 164:765792, 1988.
[225] M.S. Waterman, R. Arratia, and D.J. Galas. Pattern recognition in
several sequences: consensus and alignment. Bulletin of Mathematical
Biology, 46:515527, 1984.
[226] J. Watson and F. Crick. A structure for Deoxyribose Nucleic Acid.
Nature, 171:737, 1953.
[227] A. Waugh, P. Gendron, R. Altman, J.W. Brown, D Case, D. Gau-
theret, S.C. Harvey, N. Leontis, J. Westbrook, E. Westhof, M. Zuker,
and F. Major. RNAML: a standard syntax for exchanging RNA infor-
mation. RNA, 8(6):707717, 2002.
[228] N. Wedemeyer, T. Schmitt-John, D. Evers, C. Thiel, D. Eberhard, and
H. Jockusch. Conservation of the 3-untranslated region of the Rab1a
gene in amniote vertebrates: exceptional structure in marsupials and
possible role for posttranscriptional regulation. FEBS Letters, 477:49
54, 2000.
[229] S. Winker, R. Overbeek, C.R. Woese, G.J. Olsen, and N. Puger.
Structure detection through automated covariance search. Comp. Appl.
Biosci., 6:365371, 1990.
[230] W. Winkler, S. Cohen-Chalamish, and R.R. Breaker. An mRNA struc-
ture that controls gene expression by binding FMN. Proc. Nat. Acad.
Sci. U.S.A., 99:1590815913, 2002.
[231] K. Wojciech and B.A. Shapiro. Stem Trace: an interactive visual tool
for comparative RNA structure analysis. Bioinformatics, 15(1):1631,
1999.
198 BIBLIOGRAPHY
[232] M.T. Wolnger, W.A. Svrcek-Seiler, C. Flamm, I.L. Hofacker, and P.F.
Stadler. Ecient computation of RNA folding dynamics. J. Phys. A:
Math. Gen., 37:47314741, 2004.
[233] http://rna.ucsc.edu/rnacenter/xrna/xrna. WWW.
[234] K. Yamamoto, N. Sakurai, and H. Yoshikura. Graphics of RNA sec-
ondary structure; towards an object-oriented algorithm. Bioinformat-
ics, 3:99103, 1987.
[235] W. Yamg. Identifying syntactic dierences between two programs. Soft-
ware - Practice and Experience, 21:7, 1991.
[236] K. Zhang. Algorithms for constrained edit distance between ordered
labeled trees and related problems. Pattern Recognition, 28:463474,
1995.
[237] K. Zhang. Ecient parallel algorithms for tree editing problems. In
Proc. of CPM96, LNCS 1075, pages 361372, 1996.
[238] K. Zhang. Computing similarity between RNA secondary structures. In
Proceedings of the IEEE International Joint Symposia on Intelligence
and Systems, pages 126132, 1998.
[239] K. Zhang. Ecient Parallel Algorithm for the Editing Distance between
Ordered Trees. In Proc. of CPM98, LNCS 1448, pages 8090, 1998.
[240] K. Zhang and D. Shasha. Simple fast algorithms for the editing distance
between trees and related problems. SIAM Journal on Computing,
18(6):12451262, 1989.
[241] K. Zhang and D. Shasha. Pattern Matching Algorithms, chapter Tree
Pattern Matching, pages 341371. Oxford University Press, 1997.
[242] K. Zhang, L. Wang, and B. Ma. Computing similarity bewteen RNA
structures. In Proc. of CPM99, LNCS 1645, pages 281293, 1999.
[243] M. Zuker. On Finding All Suboptimal Foldings of an RNA Molecule.
Science, 244:4852, 1989.
[244] M. Zuker. Mfold web server for nucleic acid folding and hybridization
prediction. Nucleic Acids Res., 31(13):34063415, 2003.
[245] M. Zuker and AB. Jacobson. Using reliability information to annotate
RNA secondary structures. RNA, 4(6):669679, 1998.
BIBLIOGRAPHY 199
[246] M. Zuker, J. Jaeger, and D. Turner. A comparison of optimal and sub-
optimal RNA secondary structures predicted by free energy minimiza-
tion with structures determined by phylogenetic comparison. Nucleic
Acids Res., 19(10):27072714, 1991.
[247] M. Zuker and P. Stiegler. Optimal Computer Folding of Large RNA
Sequences using Thermodynamics and Auxiliary Information. Nucleic
Acids Res., 9:133148, 1981.