String Matching Algorithms

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 46

Design and Analysis of Algorithms

Spring 2020

National University of Computer and Emerging Sciences, Islamabad


String Matching Algorithms
References
• Intro. to Algorithms by Cormen et al

• Algorithms on Strings, Trees, and Sequences. Comp.


Science & Computational Biology by Dan Gusfield.
Section 1.1 & 1.2

• Exact String Matching Algorithms.


http://www-igm.univ-mlv.fr/~lecroq/string/index.html
String Matching: Introduction
• Finding all occurrences of a pattern in
some text
• This problem arises frequently in text editing
programs.
• Efficient algorithms improve responsiveness
• In molecular biology, biological molecules
can often be approximated as sequences
of nucleotides or amino acids.
• Search for particular patterns in DNA
sequences
• Two Types:
• Exact String Matching
• Approximate String Matching
Exact String Matching Problem

• Assume text is any array T[1..n] of length n


• pattern is an array P[1..m] of length m≤n

• Elements of T & P are characters drawn from a


finite alphabet ∑. e.g., ∑ = {0,1} or ∑= {a,b,…,z}.
• P & T are strings of characters.
Exact String Matching Problem

• P occurs with shift s in T or P occurs


beginning at position s+1 in text T if
0 ≤ s ≤ n-m and T[s+1..s+m] = P[1..m]
• i.e., T[s+j] = P[j], for 1 ≤ j ≤ m

• If P occurs with shift s, s is a valid shift;


otherwise we call s an invalid shift.
• String matching problem: finding all valid
shifts with which a given pattern P occurs in
a given text T
Exact String Matching
Problem
Approximate String Matching

• Given a text to search (T) and a pattern to look for


(P).
• Find all of the occurrences of P that exist in T,
allowing a defined number of errors to be present
in the matches

mispeld

misspelled
Importance

• Not a problem for small input


• Word Processors, utilities, in library catalog
searching programs
• Large Input
• internet crawlers, digital libraries, e-journals.
• Several hundred specialized databases holding raw
DNA, RNA and amino acid strings (e.g., US
GenBank)
• When applied on DNA databases, search may take
hours or days.
Notation & terminology
• ∑* = set of all finite-length strings from
alphabet ∑
• ε = empty string (zero length string)
• |x| = length of string x
• xy = concatenation of strings x and y, length
is |x|+|y|
• w is a prefix of x, w╘ x, if x=wy, where y Є ∑*
• w is a suffix of x, w ╛x, if x=yw, where y Є ∑*
Notation & terminology (cont’d)
• ε is both a suffix and a prefix of every string
• ab ╘ abcca
• cca ╛abcca
• We denote k-character prefix P[1..k] of the pattern
P[1..m] by Pk.
• So Po = ε and Pm = P = P[1..m]
• Similarly, k-character prefix of text T as Tk
• String matching prob.: find all shifts in the range
0 ≤ s ≤ n-m such that P ╛Ts+m
Overlapping Suffices
• x, y and z are strings and x ╛ z and y ╛ z if:
• |x|<=|y| then x ╛ y
• |x|>=|y| then y ╛ x
• |x|=|y| then x = y
String Comparison
• The test “x=y” is assumed to take time
Θ(t+1) where t is the length of the longest
string z such that z╘ x and z╘ y.

• t=0, when z = ε
Naïve string matching Algo.
• Finds all valid shifts using a loop that tests
P[1..m] = T[s+1..s+m] for each of n-m+1
values of s
• Takes Θ((n-m+1)m) time.
• If m=n/2, it becomes Θ(n2)
Naïve string matching Algo.
(pseudocode)
Naïve string matching Algo. (Cont’d)
Why is naïve algo. inefficient?
• It tests for each of the n-m+1 possible
values of s
• The info. gained about text for one value of
s is ignored in considering other values of
s
• e.g., if P=aaab and we find that s=0 is valid,
then s=1,2 or 3 are invalid, since T[4]=b
The Rabin-Karp Algorithm

• Performs well in practice


• generalizes to other algorithms for related problems
such as 2D pattern matching
• Two steps:
• Preprocessing Θ(m)
• Matching Θ((n-m+1)m)
• Based on certain assumptions, average case
running time is better
• Makes use of number theoretic notions,
equivalence of two numbers modulo a third number
The Rabin-Karp Algorithm
(Cont’d)
• Assume that ∑ = {0,1,2,…,9}, each character is
a decimal digit.
• In general, each char. is a digit in radix-d
notation, where d=|∑|
• A string of k digits = a length-k decimal number
• Given pattern P[1..m], let p denote its decimal
value
• Let ts denote decimal value of length m
substring T[s+1..s+m] for s=0,1,…,n-m
• ts=p if and only if T[s+1..s+m] = P[1..m]
The Rabin-Karp Algorithm
(Cont’d)
1. Compute p in Θ(m) time
2. Compute all ts values in Θ((n-m)+1) time
3. Then, we can determine all valid shifts in
time Θ(m) + Θ((n-m)+1) = Θ(n)

• Lets not worry that p and ts’s might be


very large
• It takes lg(a) bits to encode number a
• very large=do not fit in a computer word
The Rabin-Karp Algorithm
(Cont’d)
• We can compute p in Θ(m) time, using Horner’s
rule.
p = P[m] + 10(P[m-1] + 10(P[m-2]+…+10(P[2] + 10P[1])…))

• e.g., P[1..m]=“3457”, here m=4


• p =7+10(5+10(4+10x3))
• p =7+10(5+10(4x30)
• p =7+10(5+340)
• p =7+3450
• p =3457
The Rabin-Karp Algorithm
(Cont’d)
• Similarly compute to in Θ(m) time from T[1..m]
• Use to to compute each of t1 ,t2 ,… , tn-m in
constant time, which totals to Θ(n-m) time.

• ts+1 = 10(ts – 10m-1T[s+1]) + T[s+m+1]


• e.g., m=5 and ts=31415 and T[s+5+1]=2, s=0
• ts+1 = 10(31415 – 10000.3) + 2
• = 14152
The Rabin-Karp Algorithm
(Cont’d)
• If p and ts are very large to work with,
mathematical operations on P do not take
“constant time”
• Simple cure, perform all operations modulo a
suitable modulus q
• q is chosen, such that dq fits in computer word,
where d=|∑| and ∑={0,1,…,d-1}
• We might get spurious hits, since tsΞp (mod q)
does not imply that ts = p
The Rabin-Karp Algorithm
(Cont’d)
• It takes Θ(m) preprocessing time and Θ((n-m+1)m)
matching time in the worst case.

• In many applications, we expect few valid shifts


(some constant c), then the running time is O((n-
m+1) + cm) + time for spurious hits

• and if q is large enough (q≥m), we can reduce


spurious hits, which gives us O(n+m) running
time.
String Matching with finite
automata

• These algorithms build a finite automaton that


scans the text string T for all occurrences of
pattern P.
• Each text character is examined only once
• Time to build the automaton can be large if ∑ is
large.
Finite Automata
• A finite automaton M is a 5-tuple (Q, qo, A,
∑, δ), where
• Q is finite set of states,
• qo Q is the start state,
• A  Q is a distinguished set of accepting
states,
• ∑ is finite input alphabet,
• δ is a function from Q x ∑ into Q, called the
transition function of M
Finite Automata (Cont’d)
• Suppose M is in state qo.
• It reads char. a, it moves from state q to
state δ(q,a)
• Whenever current state q A, the machine
M has accepted the string read so far.
• An input that is not accepted is said to be
rejected
Final-State Function

• The automaton M has a final-state function  from


∑* to Q, such that:
 (w) is the state, M ends up in, after scanning the
string w.
• M accepts string w if and only if (w)  A.
• It is defined recursively as follows:
 (w) = qo if w= ε
 (wa) = δ((w), a) for w  ∑*, a ∑
String Matching Automata

• Every pattern P has finite automaton


• It must be built in the preprocessing step
• In order to do so, we first define a function called
suffix-function , corresponding to P
• It is a mapping from ∑* to {0,1,…,m} such that:
 (x) = length of the longest prefix of P that is a
suffix of x
 (x) = max{k : Pk ╛x}
String Matching Automata
• Suffix function is well defined, since Po = ε
is a suffix of every string.

• If P=ab, then:
 (ε) = 0
 (x) = 0
 (ccaca) = 1, (ccab) = 2
• String Matching Automaton corresponding
to a given pattern is defined as:
• State Set Q = { 0, 1, 2, ….. M}
• Transition function
• δ(q, a) = (Pqa)
Transition function
δ(q, a) = (Pqa)
Knuth-Morris-Pratt Algorithm

• Runs in linear time.


• Avoids the computation of the transition function
δ
• uses another function π[1..m] pre-computed from
the pattern P[1..m] in Θ(m) time
• π[q] contains info. equal to δ(q,a) for any q=0,1,
…,m and all a  ∑

38
Prefix Function π

• The prefix function π for a pattern holds


knowledge about how the pattern matches against
shifts of itself.
• This info. can be used to avoid testing useless
shifts that the naïve algorithm does
• π contains only m entries, where as δ is a table of
m|∑| = md entries

39
Prefix function

If we precompute prefix function of


P (against itself), then whenever
a mismatch occurs, the prefix function
can determine which shift(s) are invalid
and directly ruled out. So move directly
to the shift which is potentially valid.
However, there is no need to compare
these characters again since they are
equal.
Prefix Function π

• If P[1..q] = T[s+1..s+q], what is the least shift s’>s


such that:

P[1..k] = T[s’+1..s’+k] where s’+k=s+q ?

• Such a shift s’ is not necessarily invalid, due to the


knowledge of T[s+1..s+q]
• In the best case, s’ = s+q, ruling out s+1, s+2 ,
…,s+q-1.

41
Prefix Function π

• This info. can be precomputed by comparing the


pattern with itself.

• Given P[1..m], π : {1,2,…,m}  {0,1,…m-1} such


that :
• π[q] = max{ k : k<q and Pk ╛ Pq}

42
Running Time of String Matching Algorithms

You might also like