String Matching Algorithms

Design and Analysis of Algorithms
Spring 2020
National University of Computer and Emerging Sciences, Islamabad

String Matching Algorithms
References
• Intro. to Algorithms by Cormen et al
• Algorithms on Strings, Trees, and Sequences. Comp.

Science & Computational Biology by Dan Gusfield.
Section 1.1 & 1.2
• Exact String Matching Algorithms.

http://www-igm.univ-mlv.fr/~lecroq/string/index.html
String Matching: Introduction
• Finding all occurrences of a pattern in
some text
• This problem arises frequently in text editing
programs.
• Efficient algorithms improve responsiveness
• In molecular biology, biological molecules
can often be approximated as sequences
of nucleotides or amino acids.
• Search for particular patterns in DNA
sequences
• Two Types:
• Exact String Matching
• Approximate String Matching
Exact String Matching Problem
• Assume text is any array T[1..n] of length n

• pattern is an array P[1..m] of length m≤n
• Elements of T & P are characters drawn from a

finite alphabet ∑. e.g., ∑ = {0,1} or ∑= {a,b,…,z}.
• P & T are strings of characters.
Exact String Matching Problem
• P occurs with shift s in T or P occurs

beginning at position s+1 in text T if
0 ≤ s ≤ n-m and T[s+1..s+m] = P[1..m]
• i.e., T[s+j] = P[j], for 1 ≤ j ≤ m
• If P occurs with shift s, s is a valid shift;

otherwise we call s an invalid shift.
• String matching problem: finding all valid
shifts with which a given pattern P occurs in
a given text T
Exact String Matching
Problem
Approximate String Matching
• Given a text to search (T) and a pattern to look for

(P).
• Find all of the occurrences of P that exist in T,
allowing a defined number of errors to be present
in the matches
mispeld
misspelled
Importance
• Not a problem for small input

• Word Processors, utilities, in library catalog
searching programs
• Large Input
• internet crawlers, digital libraries, e-journals.
• Several hundred specialized databases holding raw
DNA, RNA and amino acid strings (e.g., US
GenBank)
• When applied on DNA databases, search may take
hours or days.
Notation & terminology
• ∑* = set of all finite-length strings from
alphabet ∑
• ε = empty string (zero length string)
• |x| = length of string x
• xy = concatenation of strings x and y, length
is |x|+|y|
• w is a prefix of x, w╘ x, if x=wy, where y Є ∑*
• w is a suffix of x, w ╛x, if x=yw, where y Є ∑*
Notation & terminology (cont’d)
• ε is both a suffix and a prefix of every string
• ab ╘ abcca
• cca ╛abcca
• We denote k-character prefix P[1..k] of the pattern
P[1..m] by Pk.
• So Po = ε and Pm = P = P[1..m]
• Similarly, k-character prefix of text T as Tk
• String matching prob.: find all shifts in the range
0 ≤ s ≤ n-m such that P ╛Ts+m
Overlapping Suffices
• x, y and z are strings and x ╛ z and y ╛ z if:
• |x|<=|y| then x ╛ y
• |x|>=|y| then y ╛ x
• |x|=|y| then x = y
String Comparison
• The test “x=y” is assumed to take time
Θ(t+1) where t is the length of the longest
string z such that z╘ x and z╘ y.
• t=0, when z = ε
Naïve string matching Algo.
• Finds all valid shifts using a loop that tests
P[1..m] = T[s+1..s+m] for each of n-m+1
values of s
• Takes Θ((n-m+1)m) time.
• If m=n/2, it becomes Θ(n2)
Naïve string matching Algo.
(pseudocode)
Naïve string matching Algo. (Cont’d)
Why is naïve algo. inefficient?
• It tests for each of the n-m+1 possible
values of s
• The info. gained about text for one value of
s is ignored in considering other values of
s
• e.g., if P=aaab and we find that s=0 is valid,
then s=1,2 or 3 are invalid, since T[4]=b
The Rabin-Karp Algorithm
• Performs well in practice

• generalizes to other algorithms for related problems
such as 2D pattern matching
• Two steps:
• Preprocessing Θ(m)
• Matching Θ((n-m+1)m)
• Based on certain assumptions, average case
running time is better
• Makes use of number theoretic notions,
equivalence of two numbers modulo a third number
(Cont’d)
• Assume that ∑ = {0,1,2,…,9}, each character is
a decimal digit.
• In general, each char. is a digit in radix-d
notation, where d=|∑|
• A string of k digits = a length-k decimal number
• Given pattern P[1..m], let p denote its decimal
value
• Let ts denote decimal value of length m
substring T[s+1..s+m] for s=0,1,…,n-m
• ts=p if and only if T[s+1..s+m] = P[1..m]
(Cont’d)
1. Compute p in Θ(m) time
2. Compute all ts values in Θ((n-m)+1) time
3. Then, we can determine all valid shifts in
time Θ(m) + Θ((n-m)+1) = Θ(n)
• Lets not worry that p and ts’s might be

very large
• It takes lg(a) bits to encode number a
• very large=do not fit in a computer word
(Cont’d)
• We can compute p in Θ(m) time, using Horner’s
rule.
p = P[m] + 10(P[m-1] + 10(P[m-2]+…+10(P[2] + 10P[1])…))
• e.g., P[1..m]=“3457”, here m=4

• p =7+10(5+10(4+10x3))
• p =7+10(5+10(4x30)
• p =7+10(5+340)
• p =7+3450
• p =3457
(Cont’d)
• Similarly compute to in Θ(m) time from T[1..m]
• Use to to compute each of t1 ,t2 ,… , tn-m in
constant time, which totals to Θ(n-m) time.
• ts+1 = 10(ts – 10m-1T[s+1]) + T[s+m+1]

• e.g., m=5 and ts=31415 and T[s+5+1]=2, s=0
• ts+1 = 10(31415 – 10000.3) + 2
• = 14152
(Cont’d)
• If p and ts are very large to work with,
mathematical operations on P do not take
“constant time”
• Simple cure, perform all operations modulo a
suitable modulus q
• q is chosen, such that dq fits in computer word,
where d=|∑| and ∑={0,1,…,d-1}
• We might get spurious hits, since tsΞp (mod q)
does not imply that ts = p
(Cont’d)
• It takes Θ(m) preprocessing time and Θ((n-m+1)m)
matching time in the worst case.
• In many applications, we expect few valid shifts

(some constant c), then the running time is O((n-
m+1) + cm) + time for spurious hits
• and if q is large enough (q≥m), we can reduce

spurious hits, which gives us O(n+m) running
time.
String Matching with finite
automata
• These algorithms build a finite automaton that

scans the text string T for all occurrences of
pattern P.
• Each text character is examined only once
• Time to build the automaton can be large if ∑ is
large.
Finite Automata
• A finite automaton M is a 5-tuple (Q, qo, A,
∑, δ), where
• Q is finite set of states,
• qo Q is the start state,
• A  Q is a distinguished set of accepting
states,
• ∑ is finite input alphabet,
• δ is a function from Q x ∑ into Q, called the
transition function of M
Finite Automata (Cont’d)
• Suppose M is in state qo.
• It reads char. a, it moves from state q to
state δ(q,a)
• Whenever current state q A, the machine
M has accepted the string read so far.
• An input that is not accepted is said to be
rejected
Final-State Function
• The automaton M has a final-state function  from

∑* to Q, such that:
 (w) is the state, M ends up in, after scanning the
string w.
• M accepts string w if and only if (w)  A.
• It is defined recursively as follows:
 (w) = qo if w= ε
 (wa) = δ((w), a) for w  ∑*, a ∑
String Matching Automata
• Every pattern P has finite automaton

• It must be built in the preprocessing step
• In order to do so, we first define a function called
suffix-function , corresponding to P
• It is a mapping from ∑* to {0,1,…,m} such that:
 (x) = length of the longest prefix of P that is a
suffix of x
 (x) = max{k : Pk ╛x}
String Matching Automata
• Suffix function is well defined, since Po = ε
is a suffix of every string.
• If P=ab, then:
 (ε) = 0
 (x) = 0
 (ccaca) = 1, (ccab) = 2
• String Matching Automaton corresponding
to a given pattern is defined as:
• State Set Q = { 0, 1, 2, ….. M}
• Transition function
• δ(q, a) = (Pqa)
Transition function
δ(q, a) = (Pqa)
Knuth-Morris-Pratt Algorithm
• Runs in linear time.

• Avoids the computation of the transition function
δ
• uses another function π[1..m] pre-computed from
the pattern P[1..m] in Θ(m) time
• π[q] contains info. equal to δ(q,a) for any q=0,1,
…,m and all a  ∑
38
Prefix Function π
• The prefix function π for a pattern holds

knowledge about how the pattern matches against
shifts of itself.
• This info. can be used to avoid testing useless
shifts that the naïve algorithm does
• π contains only m entries, where as δ is a table of
m|∑| = md entries
39
Prefix function
If we precompute prefix function of

P (against itself), then whenever
a mismatch occurs, the prefix function
can determine which shift(s) are invalid
and directly ruled out. So move directly
to the shift which is potentially valid.
However, there is no need to compare
these characters again since they are
equal.
Prefix Function π
• If P[1..q] = T[s+1..s+q], what is the least shift s’>s

such that:
P[1..k] = T[s’+1..s’+k] where s’+k=s+q ?
• Such a shift s’ is not necessarily invalid, due to the

knowledge of T[s+1..s+q]
• In the best case, s’ = s+q, ruling out s+1, s+2 ,
…,s+q-1.
41
Prefix Function π
• This info. can be precomputed by comparing the

pattern with itself.
• Given P[1..m], π : {1,2,…,m}  {0,1,…m-1} such

that :
• π[q] = max{ k : k<q and Pk ╛ Pq}
42
Running Time of String Matching Algorithms

String Matching Algorithms

Uploaded by

Copyright:

Available Formats

String Matching Algorithms

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

String Matching Algorithms

Uploaded by

Copyright:

Available Formats

Design and Analysis of Algorithms

National University of Computer and Emerging Sciences, Islamabad

• Algorithms on Strings, Trees, and Sequences. Comp.

• Exact String Matching Algorithms.

• Assume text is any array T[1..n] of length n

• Elements of T & P are characters drawn from a

• P occurs with shift s in T or P occurs

• If P occurs with shift s, s is a valid shift;

• Given a text to search (T) and a pattern to look for

• Not a problem for small input

• Performs well in practice

• Lets not worry that p and ts’s might be

• e.g., P[1..m]=“3457”, here m=4

• ts+1 = 10(ts – 10m-1T[s+1]) + T[s+m+1]

• In many applications, we expect few valid shifts

• and if q is large enough (q≥m), we can reduce

• These algorithms build a finite automaton that

• The automaton M has a final-state function  from

• Every pattern P has finite automaton

• Runs in linear time.

• The prefix function π for a pattern holds

If we precompute prefix function of

• If P[1..q] = T[s+1..s+q], what is the least shift s’>s

P[1..k] = T[s’+1..s’+k] where s’+k=s+q ?

• Such a shift s’ is not necessarily invalid, due to the

• This info. can be precomputed by comparing the

• Given P[1..m], π : {1,2,…,m}  {0,1,…m-1} such

You might also like