0% found this document useful (0 votes)
21 views9 pages

DNA5

This document discusses RNA secondary structure matching. It introduces a mathematical model where RNA molecules are modeled as words in a finite alphabet and secondary structures are modeled as planar matchings without crossings on the words. The document proves that for words in an alphabet with 3 or more letters, the fraction of unmatched letters approaches a positive limit as word length increases. It also discusses some properties of optimal matchings on ternary words after applying a pair elimination procedure to remove adjacent matching pairs. The main theorem proves that for ternary words, the fraction of unmatched letters is bounded strictly away from zero in the limit of infinite word length, establishing that secondary structures form on a positive fraction of the molecule.

Uploaded by

Jarsen21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
21 views9 pages

DNA5

This document discusses RNA secondary structure matching. It introduces a mathematical model where RNA molecules are modeled as words in a finite alphabet and secondary structures are modeled as planar matchings without crossings on the words. The document proves that for words in an alphabet with 3 or more letters, the fraction of unmatched letters approaches a positive limit as word length increases. It also discusses some properties of optimal matchings on ternary words after applying a pair elimination procedure to remove adjacent matching pairs. The main theorem proves that for ternary words, the fraction of unmatched letters is bounded strictly away from zero in the limit of infinite word length, establishing that secondary structures form on a positive fraction of the molecule.

Uploaded by

Jarsen21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

RNA-type matching

November 13, 2012


1 Introduction
We study the conjecture [1] concerning the morphological transition of the
secondary structures of random RNA-like molecules. A simplied mathemat-
ical model of such structures is that of a maximal matching without crossings
(a planar cactus-like matching) on a random word in a nite alphabet, where
matches are allowed only between specied pairs of symbols.
We prove that the fraction of molecules linked by secondary bonds in the
minimal-energy conguration of a long RNA-chain is separated from 1. The
same is true for a-a bonds in n-letter alphabets for n 3 (for n = 2 this is
wrong).
In the second part we consider words in an innite alphabet. Namely,
a random graph is constructed on the vertex set V = {1, . . . , N}. For each
pair (i, j), i = j, the edge (i, j) is drawn with probability p, independently.
Such an edge indicates the possibility of a bond between i and j in a match-
ing on V . We look for a factor of that is a maximal matching without
intersections on {1, . . . , N}.
A phase transition happens for such graphs as N . For p p

,
the fraction of unmatched vertices equals zero asymptotically. We prove
that 0.255 < p

< 0.490 and give directions for further improvement of the


bounds.
2 Model
We begin with a simplied model that is also considered in [1], where the
secondary bonds may arise between nucleotides of the same type only (a-a
1
bonds). This model allows for an arbitrary number n of nucleotide types (or
colors, in our terms).
Let A be a nite alphabet consisting of colors 1, . . . , |A| = n (say, A =
{1, 2, 3}, the main case of study in this paper). By W= W
N
we denote the
set of all N-words in A, with the uniform measure on this set.
For a word W = w
1
. . . w
N
W, a permutation M on the index set
J = {1, . . . , N} is a matching if M
2
= I and w
M(i)
= w
i
for all i J. The
pair {w
i
, w
j
} is matched if i = j and j = M(i) (and, hence, i = M(j)).
The pair of matched points {i, M(i)} may be thought as being connected
by an upper half-circle arc. We say that M has no crossings if all these arcs
are distinct, that is, if i < j < M(i) implies i < M(j) < M(i) for i, j J.
Denote by M(W) the set of all such matchings of W. In what follows we
only consider matchings from M(W).
The defect of M M(W), denoted d(M), is the cardinality of the xed
set of M (unmatched symbols), that is,
d(M) = |{i J : M(i) = i}|.
We introduce the notation
d

(W) = min
MM(W)
d(M)
(the minimal defect of matchings of W) and

d(N) = E
WW
N
d

(W)
(the expectation of the minimal defect on N-words).
Given the cardinality |A| = n of the alphabet, we dene the limit
r
n
= lim
N

d(N)
N
,
if it exists. Clearly, r
n
depends on the cardinality of A only. Actually, the
existence of r
n
follows immediately from the subadditivity of d

(W) and,
hence, of

d(N):
d

(W
1
W
2
) d

(W
1
) +d

(W
2)
,

d(W
1
W
2
)

d(W
1
) +

d(W
2)
.
2
3 Ternary alphabet
Let |A| = 3. The numerical tests suggest that r
3
0.022, and our main
goal is to prove that r
3
> 0. This would imply r
k
> 0 for all k 3 by
the following monotonicity argument. Let M be an optimal matching on a
word W in, for instance, a 4-letter alphabet. If we remove all w
i
= 4 from
W and all the edges 4 4 from M, then the remaining matching M

on the
remaining word W

is still cross-less (but, may be, no longer optimal).


The dierence d

(W) d

(W

), hence, is not less than the number of


symbols w
i
= 4 in W such that M(i) = i. In average, the number of
such symbols is
1
4
d

(W) because all the letters are equiprobable. Therefore,


Ed(W

)
3
4
d

(W). Since E|W

| =
3
4
|W|, we get r
4
r
3
.
On the other hand, r
2
= 0, since we can match two adjacent symbols of
the same color and then delete this pair from the word. Then we repeat this
action step by step, until there are no more adjacent symbols of the same
color. The remaining word W

is of the form . . . 1212 . . . , and hence d(W

)
2 since d

(12 . . . 21) = 1. Therefore d

(W) 2 for each W {1, 2}

.
For a word W = w
1
. . . w
N
, we dene the pair elimination procedure as
follows. Find the leftmost matchable pair w
i
= w
i+1
within W and delete it,
that is, reduce W to w
1
. . . w
i1
w
i+2
. . . w
N
. Repeat this action till there are
no more matchable pairs to delete. Denote the nal reduced word by W

.
Proposition 3.1.
d(W

) = d(W).
Proof. It suces to prove that for each pair (w
i
= w
i+1
in W there exists an
maximal matching on W that contains this pair. Let us consider an optimal
matching M of W. First, if, say, w
i
is unmatched in M, we can match w
i+1
with w
i
instead of its current partner and the defect remains unchanged.
Now suppose that w
i
is matched with w
a
and w
i+1
is matched with w
b
in M. Then either a < i < i + 1 < b or b < a < i < i + 1 by the cross-less
nature of M. Hence we can match, instead, w
i
with w
i+1
and w
a
with w
b
and no new crossings arise.
For an initial word W, let W

be the result of pair elimination.


Proposition 3.2. The resulting word W

has mean length that equals N/3


asymptotically, that is, lim
N
|W

|
N/3
= 1. The symbols of W

are distributed
according to the Markov process with the transition probabilities p(a, b) = 1/2
3
for all pairs a = b. The leftmost symbol is 1, 2, or 3 with equal probability
1/3. Hence, all the repetition-free words of a given length are equiprobable.
Proof. The word W

can be generated from W as the result of the following


procedure, equivalent to certain pair elimination. Set U
0
= and at each
step i = 1, . . . , N, compare the last symbol of U
i1
with w
i
. If they are
dierent or if U
i1
= , we set U
i
= U
i1
w
i
. Otherwise we delete the last
symbol of U
i1
. Obviously, U
N
= W

.
If U
i1
= , then it is independent of w
i
and
E(|U
i
| |U
i1
|) = 1/3
(random walk with a positive drift on Z
+
). Hence, E|U
N
| N/3 as N
(obviously, E|U
N
| N/3 for each N, which is sucient for our purposes).
The Markov property follows as well, by induction on i.
4 The main theorem
Let us prove our main assertion.
Theorem 4.1.
r
3
> 0.
Proof. After the pair elimination, it remains to prove the assertion for uni-
formly distributed ternary words without repetitions. Let M be an optimal
matching of such a word V . Let us chose an unmatched point w
i
and elim-
inate it together with all the matchable pairs (w
i1
, w
i+1
), . . . , (w
im
, w
i+m
)
until (w
im1
, w
i+m+1
) is either unmatchable or exceeds the limits of word
V .
The remaining word V

, again, has no repetitions. Let us show that


d(V

) = d(V ) 1. Indeed, the elimination of w


i
reduces the defect by at
least 1 since the matching scheme on the remaining word is still cross-less. If
there exists a better cross-less matching on the remaining word, this match-
ing is also a better one for the initial word, which is a contradiction. The
subsequent elimination of pairs does not change the defect.
We may also say that, at a single step, we remove one of the unmatched
symbols together with a maximal palindrome of odd length, for which this
symbol is a center.
4
Now, we will nd a lower bound for the number of such eliminations
till the word is deleted completely. Instead of palindrome deletion, we will
study the equivalent operation of serial insertions of palindromes into the
current word. How many palindromes do we need to construct a given ternary
repetition-free word of length N if we start from the empty word?
Suppose we are given a ternary repetition-free word V of length N (we
use N instead of N/3 for economy of notation) that is constructed from K
repetition-free palindromes P P of odd length. In what follows we suggest
a code for such a construction. Comparing the total number of words V
(clearly, this is
3
2
2
N
, and each word appears with equal probability) and the
total number of such codes, we will prove the theorem.
Note that each symbol v
i
of V is a contribution of one of the K palin-
dromes P = P(i) P. Let us construct an order P
1
, . . . , P
K
on P as follows.
We denote i
1
= 1, P
1
= P(i
1
) and mark all the symbols v
i
V that are
contributed by P
1
. Then we nd the minimal j {1, . . . , N} such that v
j
is
unmarked, denote i
2
= j, P
2
= P(i
2
), and mark all the symbols v
i
V that
are contributed by P
2
. We repeat this procedure till V is exhausted.
As a result, we get an increasing sequence i
1
, . . . , i
K
, and a sequence of
odd-length palindromes P
1
, . . . , P
K
, such that |P
1
| + + |P
K
| = N. Note
that these two sequences dene the word V in a unique way. Indeed, the
construction of V happens as follows. We set V
1
= P
1
, then insert P
2
into V
1
at the position i
2
, and iterate this procedure for P
3
, . . . , P
K
. As a result, we
get V = V
K
.
Indeed, let us prove it by induction in K. The only thing to prove is that
P
K
is contained in V as a whole, without inserted parts of other palindromes.
Suppose the contrary. Then there exists a palindrome P(i) such that some
symbols of P(K) lie within P(i) and vice-versa. This is, however, impossible
since then these palindromes would not be deletable at any step.
Now, let us nd an upper bound for the number of words V that can
be constructed in this manner. To begin with, the number of sequences
m
1
, . . . , m
K
(that are all odd and sum up to N) is bounded from above by
(
K
NK
2
)
(if N = K mod 2). Analogously, the number of sequences i
1
, . . . , i
K
is bounded from above by
(
N
K
)
.
It remains to estimate the number of K-tuples of palindromes of lengths
m
1
, . . . , m
K
. These palindromes are free of repetitions, by construction.
Moreover, apart from the rst palindrome P
1
, all the remaining ones have a
restriction on the rst symbol: no repetition should arise as a palindrome P
k
5
is inserted at the position i
k
of the current word V
k1
. Hence, the required
upper bound is
3
2
2
N+K
2
.
For the binomial coecients we will use the entropy bound
[N]

k=0
(
N
k
)
2
H()N
,
where
H() = log (1 ) log(1 ).
Now, if we set K = N, the number of repetition-free N-words that can
be constructed from K odd-length palindromes is bounded from above by
the product
() = 2
1+
2
N
2
H()N
2
1
2
H(
2
1
)N
.
Recall that the number of repetition-free ternary words equals
3
2
2
N
and
conclude that, for any 0 that satises the inequality
1 +
2
+
1
2
(

2
1
log
2
1
(1
2
1
) log(1
2
1
)
)
+
+(log (1 ) log(1 )) < 1, (4.1)
the fraction of words that can be eliminated in K = [N] steps tends to
zero and, hence, the mean defect of V exceeds N for N large enough. It
remains to notice that
lim
0
log = 0, lim
0
(1 ) log(1 ) = 0,
and, hence, (4.1) holds for all > 0 small enough.
We may also interpret the complete elimination process as follows. At
each step, a palindrome is extracted from the current word. All even-length
palindromes are free of charge, and each odd-length palindrome costs 1 rou-
ble. The goal is to nd the cheapest way to delete the word. Still more
elementary, we may operate with palindromes of length 1 and 2 only.
The proof of Theorem 4.1 gives a lower bound /3 = 0.015 for r
3
whereas
the numerical tests suggest r
3
0.022. The gap is, apparently, due to the
fact that a word V can be constructed from K palindromes in many ways.
For instance, 1212 has 2 dierent palindrome schemes for K = 2: [1][212]
and [121][2].
6
5 RNA-type case
Here |A| = 4 and only the matches (1, 2) and (3, 4) are allowed. We prove
that the mean fraction of unmatched symbols in an optimal matching is
separated from 0 as N . The proof is mainly the same as for r
3
since
the pair elimination procedure does not aect the defect of the word again,
by the same argument. Note that this is no longer true in the case of a
general matching graph as the following example demonstrates.
Let A = {1, 2}, G = {(1, 1), (1, 2)} (matching graph), and let W = 1122.
Then the pair elimination process matches w
1
with w
2
and leaves w
3
and w
4
unmatched. The optimal matching is, however, (w
1
, w
4
) and (w
2
, w
3
). As a
result, we cannot derive a general criterion for completeness of a matching
scheme that is generated by a general matching graph G.
6 Random matching
As a generalization, the most natural case is that of a random graph on
{1, . . . , N}. (this is equivalent to a random graph on a countable number
of colors). The parameter p [0, 1] is the probability of a given pair i, j to
be connected by an edge. We denote by s
p
the corresponding limit of scaled
expected defects. It is not hard to demonstrate that s
p
= 1 for all p 1/2,
see the pair elimination procedure above.
We can improve this upper bound if we consider another queueing pro-
cess, where new symbols arrive in m-batches, m > 1, and for each batch an
maximal complete matching is selected and deleted. In this case the drift of
the queue is decreasing as m grows up.
For m = 3 the only case where the simple pair elimination gives lower
eciency, is the following one. The symbols 1, 2, 3 are in the queue (in this
order) and the symbols 4, 5, 6 arrive (in this order). Matchable pairs are
(3, 4), (4, 5), (3, 6). The optimal matching is (3, 6), (4, 5), whereas the pair
elimination procedure deletes only one pair (3, 4).
The resulting upper bound on p

is slightly below 0.49. Numerical random


tests demonstrate that, for m = 200, the upper bound is below 0.4 but the
direct proof for m > 5 requires too much computing power.
One can also prove that s
p
< 1 for all p < 0.25. This can be seen from
the Dyck path argument [1]. The Dyck paths correspond bijectively to the
complete matchings. Recall that the Catalan number C
n

4
n
n
3/2

is exactly
7
the number of Dyck paths of length 2n.
If the probability of a given Dyck path realization is below 1/C
N/2
(in
terms of 2
N
), then there are not enough perfect matchings for N-words.
This probability is equal to p
N/2
and C
N/2
2
N
that gives the required
lower bound.
For non-perfect matchings just notice that such a matching becomes per-
fect after elimination of d(W) unmatched symbols. The number of such
eliminations is given by
(
N
d(W)
)
, and this one can be bounded from above by
the entropy bound, again.
The lower bound on p

can also be improved by the following argument.


Let us estimate the number of Dyck paths that correspond to a perfectly
matchable word. Let us demonstrate that this number exceeds 2
N/4
(in
expectation).
Indeed, we can extract perfect matching interval of length 4 from each
perfect matching (take an appropriate summit of the corresponding Dyck
path). The matching pattern of this interval is either (1, 2), (3, 4) or (1, 4), (2, 3).
In either one of these cases, the other one is also possible with probability p
2
,
hence, two versions of Dyck path correspond to the same graph on {1, . . . , N}.
Then we repeat the procedure until the word W is exhausted. As a result
we get about 2
N
4
p
2
Dyck paths. This argument raises the lower bound for p

to 0.255.
The lower bound can be improved further if we extract longer perfect
matching intervals from the perfect matching of W. We conjecture that, in
this way, both the lower and the upper bounds can be brought arbitrarily
close to p

. Numerical results suggest that the corresponding phase transition


happens actually at p

close to p = 0.38.
We also conjecture that other phase transitions happen at the same value
p

. These are, for instance:


1. For the probability of a long even-length word to be perfectly matchable
to jump from 0 to 1;
2. For the mean length of a maximal perfect matching interval containing
a given symbol to reach +;
3. The same for an interval starting at a given symbol;
4. For the probability of a given symbol to be not covered by any perfect
matching to become zero.
8
5. For the probability of an even-length word to have zero defect to jump
from 0 to 1.
6. For the existence of a stationary complete cross-less matching on Z.
7 Multicolor crossless matchings
Let G be an unoriented nite graph on the vertex set A. For a word W in
alphabet A, the graph on {1, . . . , |W|} is generated by G: the positions i
and j are connected if and only if w
i
and w
j
are connected in G. The problem
is if this model admits an asymptotically complete cross-less matching.
The following algorithm of construction of such a matching on a right-
innite word (W = Z
+
) is suggested. At each step we add the next m
symbols from W to the current queue and then delete a maximal complete
matching from this extended queue. We conjecture that the model admits
an asymptotically complete cross-less matching if and only if, for some m,
this algorithm is stable (ergodic up to the parity of the queue).
A weaker property of G (together with the inow W) is that of exis-
tence of any stationary complete matching (with crossings). This is possible,
presumably, if and only if the following stability condition holds for each
independent subset B A :
(B) < (

B),
where (B) is the total arrival rate of symbols of B and

B = {a A : (a, b) G for some b B}


is the 1-neighborhood of B.
References
[1] O. V. Valba, M. V. Tamm, and S. K. Nechaev. New alphabet-dependent
morphological transition in random RNA alignment. Phys. Rev. Lett.,
109:018102, Jul 2012.
9

You might also like