DNA5
DNA5
,
the fraction of unmatched vertices equals zero asymptotically. We prove
that 0.255 < p
(W) = min
MM(W)
d(M)
(the minimal defect of matchings of W) and
d(N) = E
WW
N
d
(W)
(the expectation of the minimal defect on N-words).
Given the cardinality |A| = n of the alphabet, we dene the limit
r
n
= lim
N
d(N)
N
,
if it exists. Clearly, r
n
depends on the cardinality of A only. Actually, the
existence of r
n
follows immediately from the subadditivity of d
(W) and,
hence, of
d(N):
d
(W
1
W
2
) d
(W
1
) +d
(W
2)
,
d(W
1
W
2
)
d(W
1
) +
d(W
2)
.
2
3 Ternary alphabet
Let |A| = 3. The numerical tests suggest that r
3
0.022, and our main
goal is to prove that r
3
> 0. This would imply r
k
> 0 for all k 3 by
the following monotonicity argument. Let M be an optimal matching on a
word W in, for instance, a 4-letter alphabet. If we remove all w
i
= 4 from
W and all the edges 4 4 from M, then the remaining matching M
on the
remaining word W
(W) d
(W
)
3
4
d
| =
3
4
|W|, we get r
4
r
3
.
On the other hand, r
2
= 0, since we can match two adjacent symbols of
the same color and then delete this pair from the word. Then we repeat this
action step by step, until there are no more adjacent symbols of the same
color. The remaining word W
)
2 since d
.
For a word W = w
1
. . . w
N
, we dene the pair elimination procedure as
follows. Find the leftmost matchable pair w
i
= w
i+1
within W and delete it,
that is, reduce W to w
1
. . . w
i1
w
i+2
. . . w
N
. Repeat this action till there are
no more matchable pairs to delete. Denote the nal reduced word by W
.
Proposition 3.1.
d(W
) = d(W).
Proof. It suces to prove that for each pair (w
i
= w
i+1
in W there exists an
maximal matching on W that contains this pair. Let us consider an optimal
matching M of W. First, if, say, w
i
is unmatched in M, we can match w
i+1
with w
i
instead of its current partner and the defect remains unchanged.
Now suppose that w
i
is matched with w
a
and w
i+1
is matched with w
b
in M. Then either a < i < i + 1 < b or b < a < i < i + 1 by the cross-less
nature of M. Hence we can match, instead, w
i
with w
i+1
and w
a
with w
b
and no new crossings arise.
For an initial word W, let W
|
N/3
= 1. The symbols of W
are distributed
according to the Markov process with the transition probabilities p(a, b) = 1/2
3
for all pairs a = b. The leftmost symbol is 1, 2, or 3 with equal probability
1/3. Hence, all the repetition-free words of a given length are equiprobable.
Proof. The word W
.
If U
i1
= , then it is independent of w
i
and
E(|U
i
| |U
i1
|) = 1/3
(random walk with a positive drift on Z
+
). Hence, E|U
N
| N/3 as N
(obviously, E|U
N
| N/3 for each N, which is sucient for our purposes).
The Markov property follows as well, by induction on i.
4 The main theorem
Let us prove our main assertion.
Theorem 4.1.
r
3
> 0.
Proof. After the pair elimination, it remains to prove the assertion for uni-
formly distributed ternary words without repetitions. Let M be an optimal
matching of such a word V . Let us chose an unmatched point w
i
and elim-
inate it together with all the matchable pairs (w
i1
, w
i+1
), . . . , (w
im
, w
i+m
)
until (w
im1
, w
i+m+1
) is either unmatchable or exceeds the limits of word
V .
The remaining word V
k=0
(
N
k
)
2
H()N
,
where
H() = log (1 ) log(1 ).
Now, if we set K = N, the number of repetition-free N-words that can
be constructed from K odd-length palindromes is bounded from above by
the product
() = 2
1+
2
N
2
H()N
2
1
2
H(
2
1
)N
.
Recall that the number of repetition-free ternary words equals
3
2
2
N
and
conclude that, for any 0 that satises the inequality
1 +
2
+
1
2
(
2
1
log
2
1
(1
2
1
) log(1
2
1
)
)
+
+(log (1 ) log(1 )) < 1, (4.1)
the fraction of words that can be eliminated in K = [N] steps tends to
zero and, hence, the mean defect of V exceeds N for N large enough. It
remains to notice that
lim
0
log = 0, lim
0
(1 ) log(1 ) = 0,
and, hence, (4.1) holds for all > 0 small enough.
We may also interpret the complete elimination process as follows. At
each step, a palindrome is extracted from the current word. All even-length
palindromes are free of charge, and each odd-length palindrome costs 1 rou-
ble. The goal is to nd the cheapest way to delete the word. Still more
elementary, we may operate with palindromes of length 1 and 2 only.
The proof of Theorem 4.1 gives a lower bound /3 = 0.015 for r
3
whereas
the numerical tests suggest r
3
0.022. The gap is, apparently, due to the
fact that a word V can be constructed from K palindromes in many ways.
For instance, 1212 has 2 dierent palindrome schemes for K = 2: [1][212]
and [121][2].
6
5 RNA-type case
Here |A| = 4 and only the matches (1, 2) and (3, 4) are allowed. We prove
that the mean fraction of unmatched symbols in an optimal matching is
separated from 0 as N . The proof is mainly the same as for r
3
since
the pair elimination procedure does not aect the defect of the word again,
by the same argument. Note that this is no longer true in the case of a
general matching graph as the following example demonstrates.
Let A = {1, 2}, G = {(1, 1), (1, 2)} (matching graph), and let W = 1122.
Then the pair elimination process matches w
1
with w
2
and leaves w
3
and w
4
unmatched. The optimal matching is, however, (w
1
, w
4
) and (w
2
, w
3
). As a
result, we cannot derive a general criterion for completeness of a matching
scheme that is generated by a general matching graph G.
6 Random matching
As a generalization, the most natural case is that of a random graph on
{1, . . . , N}. (this is equivalent to a random graph on a countable number
of colors). The parameter p [0, 1] is the probability of a given pair i, j to
be connected by an edge. We denote by s
p
the corresponding limit of scaled
expected defects. It is not hard to demonstrate that s
p
= 1 for all p 1/2,
see the pair elimination procedure above.
We can improve this upper bound if we consider another queueing pro-
cess, where new symbols arrive in m-batches, m > 1, and for each batch an
maximal complete matching is selected and deleted. In this case the drift of
the queue is decreasing as m grows up.
For m = 3 the only case where the simple pair elimination gives lower
eciency, is the following one. The symbols 1, 2, 3 are in the queue (in this
order) and the symbols 4, 5, 6 arrive (in this order). Matchable pairs are
(3, 4), (4, 5), (3, 6). The optimal matching is (3, 6), (4, 5), whereas the pair
elimination procedure deletes only one pair (3, 4).
The resulting upper bound on p
is exactly
7
the number of Dyck paths of length 2n.
If the probability of a given Dyck path realization is below 1/C
N/2
(in
terms of 2
N
), then there are not enough perfect matchings for N-words.
This probability is equal to p
N/2
and C
N/2
2
N
that gives the required
lower bound.
For non-perfect matchings just notice that such a matching becomes per-
fect after elimination of d(W) unmatched symbols. The number of such
eliminations is given by
(
N
d(W)
)
, and this one can be bounded from above by
the entropy bound, again.
The lower bound on p
to 0.255.
The lower bound can be improved further if we extract longer perfect
matching intervals from the perfect matching of W. We conjecture that, in
this way, both the lower and the upper bounds can be brought arbitrarily
close to p
close to p = 0.38.
We also conjecture that other phase transitions happen at the same value
p