AOA Module 6 - String of Algorithms - Aeraxia - in

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Q1) Knuth-Morris-Pratt Algorithm

KMP Algorithm for Pattern Matching


The KMP algorithm is used to solve the pattern matching problem which is a task of finding all the
occurrences of a given pattern in a text. It is very useful when it comes to finding multiple patterns.
For instance, if the text is "aabbaaccaabbaadde" and the pattern is "aabaa", then the pattern occurs
twice in the text, at indices 0 and 8.
The naive solution to this problem is to compare the pattern with every possible substring of the
text, starting from the leftmost position and moving rightwards. This takes O(n*m) time, where 'n'
is the length of the text and 'm' is the length of the pattern.
When we work with long text documents, the brute force and naive approaches may result in
redundant comparisons. To avoid such redundancy, Knuth, Morris, and Pratt developed a linear
sequence-matching algorithm named the KMP pattern matching algorithm. It is also referred to
as Knuth Morris Pratt pattern matching algorithm.

How does KMP Algorithm work?


The KMP algorithm starts the search operation from left to right. It uses the prefix function to
avoid unnecessary comparisons while searching for the pattern. This function stores the number of
characters matched so far which is known as LPS value. The following steps are involved in KMP
Algorithm −
1. Define a prefix function.
2. Slide the pattern over the text for comparison.
3. If all the characters match, we have found a match.
4. If not, use the prefix function to skip the unnecessary comparisons. If the LPS value of
previous character from the mismatched character is '0', then start comparison from
index 0 of pattern with the next character in the text. However, if the LPS value is
more than '0', start the comparison from index value equal to LPS value of the
previously mismatched character.
The KMP algorithm takes O(n + m) time and O(m) space. It is faster than the naive solution
because it skips the redundant comparisons, and only compares each character of the text at most
once.
Input:
main String: "AAAABCAAAABCBAAAABC"
pattern: "AAABC"
Output:
Pattern found at position: 1
Pattern found at position: 7
Pattern found at position: 14
Example
The following example practically illustrates the KMP algorithm for pattern matching.

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
// function to find prefix
void prefixSearch(char* pat, int m, int* pps) {
int length = 0;
// array to store prefix
pps[0] = 0;
int i = 1;
while(i < m) {
// to check if the current character matches the previous character
if(pat[i] == pat[length]) {
// increment the length
length++;
// store the length in the prefix array
pps[i] = length;
}else {
if(length != 0) {
// to update length of previous prefix length
length = pps[length - 1];
i--;
} else
// if the length is 0, store 0 in the prefix array
pps[i] = 0;
}
i++; // incrementing i
}
}
// function to search for pattern
void patrnSearch(char* orgnString, char* patt, int m, int *locArray, int
*loc) {
int n, i = 0, j = 0;
n = strlen(orgnString);
// array to store the prefix values
int* prefixArray = (int*)malloc(m * sizeof(int)); // allocate memory for
the prefix array
// calling prefix function to fill the prefix array
prefixSearch(patt, m, prefixArray);
*loc = 0; // initialize the location index
while(i < n) {
// checking if main string character matches pattern string character
if(orgnString[i] == patt[j]) {
// increment both i and j
i++;
j++;
}
// if j and m are equal pattern is found
if(j == m) {
// store the location of the pattern
locArray[*loc] = i-j;
(*loc)++; // increment the location index
// update j to the previous prefix value
j = prefixArray[j-1];
// checking if i is less than n and the current characters do not match
}else if(i < n && patt[j] != orgnString[i]) {
if(j != 0)
// update j to the previous prefix value
j = prefixArray[j-1];
// if j is zero
else
i++; // increment i
}
}
free(prefixArray); // free the memory of the prefix array
}
int main() {
// declare the original text
char* orgnStr = "AAAABCAEAAABCBDDAAAABC";
// pattern to be found
char* patrn = "AAABC";
// get the size of the pattern
int m = strlen(patrn);
// array to store the locations of the pattern
int locationArray[strlen(orgnStr)];
// to store the number of locations
int index;
// calling pattern search function
patrnSearch(orgnStr, patrn, m, locationArray, &index);
// to loop through location array
for(int i = 0; i<index; i++) {
// print the location of the pattern
printf("Pattern found at location: %d\n", locationArray[i]);
}
}
Output
Pattern found at location: 1
Pattern found at location: 8
found at location: 17
1.

Q2) Rewrite and Compare Rabin Karp and Knuth Morris Pratt Algorithms
Give the pseudo code for the KMP String Matching Algorithm
The Rabin-Karp-Algorithm

The Rabin-Karp string matching algorithm calculates a hash value for the pattern, as well as for each M-
character subsequences of text to be compared. If the hash values are unequal, the algorithm will
determine the hash value for next M-character sequence. If the hash values are equal, the algorithm will
analyze the pattern and the M-character sequence. In this way, there is only one comparison per text
subsequence, and character matching is only required when the hash values match.

RABIN-KARP-MATCHER (T, P, d, q)

1. n ← length [T]

2. m ← length [P]

3. h ← dm-1 mod q
4. p ← 0

5. t0 ← 0

6. for i ← 1 to m

7. do p ← (dp + P[i]) mod q

8. t0 ← (dt0+T [i]) mod q

9. for s ← 0 to n-m

10. do if p = ts

11. then if P [1.....m] = T [s+1.....s + m]

12. then "Pattern occurs with shift" s

13. If s < n-m

14. then ts+1 ← (d (ts-T [s+1]h)+T [s+m+1])mod q

Example: For string matching, working module q = 11, how many spurious hits does the
Rabin-Karp matcher encounters in Text T = 31415926535.......

T = 31415926535.......

P = 26

Here T.Length =11 so Q = 11

And P mod Q = 26 mod 11 = 4

Now find the exact match of P mod Q...

Solution:
Complexity:
The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-m+1) m but it has a
good average case running time. If the expected number of strong shifts is small O (1) and prime q is
chosen to be quite large, then the Rabin-Karp algorithm can be expected to run in time O (n+m) plus
the time to require to process spurious hits.

Knuth-Morris and Pratt introduce a linear time algorithm for the string matching problem. A matching
time of O (n) is achieved by avoiding comparison with an element of 'S' that have previously been involved
in comparison with some element of the pattern 'p' to be matched. i.e., backtracking on the string 'S' never
occurs

Components of KMP Algorithm:

1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates knowledge about how the
pattern matches against the shift of itself. This information can be used to avoid a useless shift of the
pattern 'p.' In other words, this enables avoiding backtracking of the string 'S.'

2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the occurrence of
'p' in 'S' and returns the number of shifts of 'p' after which occurrences are found

The Prefix Function (Π)

Following pseudo code compute the prefix function, Π:


COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π

Running Time Analysis:


In the above pseudo code for calculating the prefix function, the for loop from step 4 to step 10 runs 'm'
times. Step1 to Step3 take constant time. Hence the running time of computing prefix function is O (m).

Example: Compute Π for the pattern 'p' below:

Solution:

Initially: m = length [p] = 7


Π [1] = 0
k=0
After iteration 6 times, the prefix function computation is complete:

The KMP Matcher:


The KMP Matcher with the pattern 'p,' the string 'S' and prefix function 'Π' as input, finds a match of p in
S. Following pseudo code compute the matching component of KMP algorithm:

KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does not match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character matches
10. If q = m // is all of p matched?
11. then print "Pattern occurs with shift" i - m
12. q ← Π [q] // look for the next match

Running Time Analysis:


The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.' Since step 1 to
step 4 take constant times, the running time is dominated by this for the loop. Thus running time of the
matching function is O (n).

Example: Given a string 'T' and pattern 'P' as follows:

Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'

For 'p' the prefix function, ? was computed previously and is as follows:
Solution:

Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of shifts that took place
for the match to be found is i-m = 13 - 7 = 6 shifts.
Q3) The Naive String Matching Algorithm
The naïve approach tests all the possible placement of Pattern P [1.......m] relative to text T [1......n].
We try shift s = 0, 1.......n-m, successively and for each shift s. Compare T [s+1.......s+m] to P [1......m].
The naïve algorithm finds all valid shifts using a loop that checks the condition P [1.......m] = T
[s+1.......s+m] for each of the n - m +1 possible value of s.

NAIVE-STRING-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. for s ← 0 to n -m
4. do if P [1.....m] = T [s + 1....s + m]
5. then print "Pattern occurs with shift" s
Analysis: This for loop from 3 to 5 executes for n-m + 1(we need at least m characters at the end)
times and in iteration we are doing m comparisons. So the total complexity is O (n-m+1).

Example:
o Suppose T = 1011101110
o P = 111
o Find all the Valid Shift
Solution:
Q4) Naive algorithm for Pattern Searching
Given text string with length n and a pattern with length m, the task is to prints all
occurrences of pattern in text.
Note: You may assume that n > m.

Examples:

Input: text = "THIS IS A TEST TEXT", pattern = "TEST"


Output: Pattern found at index 10

Input: text = "AABAACAADAABAABA", pattern = "AABA"


Output: Pattern found at index 0, Pattern found at index 9, Pattern found at
index 12
Pattern searching

Slide the pattern over text one by one and check for a match. If a match is found,
then slide by 1 again to check for subsequent matches

#include <stdio.h>
#include <string.h>

void search(char* pat, char* txt) {


int M = strlen(pat);
int N = strlen(txt);

// A loop to slide pat[] one by one


for (int i = 0; i <= N - M; i++) {
int j;

// For current index i, check for pattern match


for (j = 0; j < M; j++) {
if (txt[i + j] != pat[j]) {
break;
}
}

// If pattern matches at index i


if (j == M) {
printf("Pattern found at index %d\n", i);
}
}
}

int main() {
// Example 1
char txt1[] = "AABAACAADAABAABA";
char pat1[] = "AABA";
printf("Example 1:\n");
search(pat1, txt1);

// Example 2
char txt2[] = "agd";
char pat2[] = "g";
printf("\nExample 2:\n");
search(pat2, txt2);

return 0;
}

Output

Pattern found at index 0


Pattern found at index 9
Pattern found at index 13

Time Complexity: O(N2)


Auxiliary Space: O(1

Complexity Analysis of Naive algorithm for Pattern Searching:


Best Case: O(n)

• When the pattern is found at the very beginning of the text (or very early on).

• The algorithm will perform a constant number of comparisons, typically


on the order of O(n) comparisons, where n is the length of the pattern.

Worst Case: O(n2)

• When the pattern doesn't appear in the text at all or appears only at the very
end.

• The algorithm will perform O((n-m+1)*m) comparisons, where n is the length


of the text and m is the length of the pattern.
• In the worst case, for each position in the text, the algorithm may need to
compare the entire pattern against the text

Q5) String Matching with Finite Automata


The string-matching automaton is a very useful tool which is used in string matching algorithm. It
examines every character in the text exactly once and reports all the valid shifts in O (n) time. The goal of
string matching is to find the location of specific text pattern within the larger body of text (a sentence, a
paragraph, a book, etc.)

Finite Automata:
A finite automaton M is a 5-tuple (Q, q0,A,∑δ), where

o Q is a finite set of states,


o q0 ∈ Q is the start state,
o A ⊆ Q is a notable set of accepting states,
o ∑ is a finite input alphabet,
o δ is a function from Q x ∑ into Q called the transition function of M.

The finite automaton starts in state q0 and reads the characters of its input string one at a time. If the
automaton is in state q and reads input character a, it moves from state q to state δ (q, a). Whenever its
current state q is a member of A, the machine M has accepted the string read so far. An input that is not
allowed is rejected.

A finite automaton M induces a function ∅ called the called the final-state function, from ∑* to Q such
that ∅(w) is the state M ends up in after scanning the string w. Thus, M accepts a string w if and only if
∅(w) ∈ A.

The function f is defined as

∅ (∈)=q0
∅ (wa) = δ ((∅ (w), a) for w ∈ ∑*,a∈ ∑)

FINITE- AUTOMATON-MATCHER (T,δ, m),

1. n ← length [T]
2. q ← 0
3. for i ← 1 to n
4. do q ← δ (q, T[i])
5. If q =m
6. then s←i-m
7. print "Pattern occurs with shift s" s

The primary loop structure of FINITE- AUTOMATON-MATCHER implies that its running time on a
text string of length n is O (n).

Computing the Transition Function: The following procedure computes the transition function δ from
given pattern P [1......m]

COMPUTE-TRANSITION-FUNCTION (P, ∑)
1. m ← length [P]
2. for q ← 0 to m
3. do for each character a ∈ ∑*
4. do k ← min (m+1, q+2)
5. repeat k←k-1
6. Until
7. δ(q,a)←k
8. Return δ

Example: Suppose a finite automaton which accepts even number of a's where ∑ = {a, b, c}

Solution:

q0 is the initial state.


Q6) The Boyer-Moore Algorithm
Robert Boyer and J Strother Moore established it in 1977. The B-M String search algorithm is a particularly
efficient algorithm and has served as a standard benchmark for string search algorithm ever since.

The B-M algorithm takes a 'backward' approach: the pattern string (P) is aligned with the start of the text
string (T), and then compares the characters of a pattern from right to left, beginning with rightmost
character.

If a character is compared that is not within the pattern, no match can be found by analyzing any further
aspects at this position so the pattern can be changed entirely past the mismatching character.

For deciding the possible shifts, B-M algorithm uses two preprocessing strategies simultaneously.
Whenever a mismatch occurs, the algorithm calculates a variation using both approaches and selects the
more significant shift thus, if make use of the most effective strategy for each case.

The two strategies are called heuristics of B - M as they are used to reduce the search. They are:

o Bad Character Heuristics


o Good Suffix Heuristics

1. Bad Character Heuristics


This Heuristics has two implications:

o Suppose there is a character in a text in which does not occur in a pattern at all. When a
mismatch happens at this character (called as bad character), the whole pattern can be
changed, begin matching form substring next to this 'bad character.'
o On the other hand, it might be that a bad character is present in the pattern, in this case, align
the nature of the pattern with a bad character in the text.

Thus in any case shift may be higher than one.

Example1: Let Text T = <nyoo nyoo> and pattern P = <noyo>


Example 2: If bad character doesn't exist the pattern then.

Problem in Bad-Character Heuristics:


In some cases, Bad-Character Heuristics produces some negative shifts.
For Example:

This means that we need some extra information to produce a shift on encountering a bad character. This
information is about the last position of every aspect in the pattern and also the set of characters used in
a pattern (often called the alphabet ∑of a pattern).

COMPUTE-LAST-OCCURRENCE-FUNCTION (P, m, ∑ )
1. for each character a ∈ ∑
2. do λ [a] = 0
3. for j ← 1 to m
4. do λ [P [j]] ← j
5. Return λ

2. Good Suffix Heuristics:


A good suffix is a suffix that has matched successfully. After a mismatch which has a negative shift in bad
character heuristics, look if a substring of pattern matched till bad character has a good suffix in it, if it is
so then we have an onward jump equal to the length of suffix found.

Example:
COMPUTE-GOOD-SUFFIX-FUNCTION (P, m)
1. Π ← COMPUTE-PREFIX-FUNCTION (P)
2. P'← reverse (P)
3. Π'← COMPUTE-PREFIX-FUNCTION (P')
4. for j ← 0 to m
5. do ɣ [j] ← m - Π [m]
6. for l ← 1 to m
7. do j ← m - Π' [L]
8. If ɣ [j] > l - Π' [L]
9. then ɣ [j] ← 1 - Π'[L]
10. Return ɣ

BOYER-MOORE-MATCHER (T, P, ∑)

1. n ←length [T]
2. m ←length [P]
3. λ← COMPUTE-LAST-OCCURRENCE-FUNCTION (P, m, ∑ )
4. ɣ← COMPUTE-GOOD-SUFFIX-FUNCTION (P, m)
5. s ←0
6. While s ≤ n - m
7. do j ← m
8. While j > 0 and P [j] = T [s + j]
9. do j ←j-1
10. If j = 0
11. then print "Pattern occurs at shift" s
12. s ← s + ɣ[0]
13. else s ← s + max (ɣ [j], j - λ[T[s+j]])
Complexity Comparison of String Matching Algorithm:
Algorithm Preprocessing Time Matching Time

Naive O (O (n - m + 1)m)

Rabin-Karp O(m) (O (n - m + 1)m)

Finite Automata O(m|∑|) O (n)

Knuth-Morris-Pratt O(m) O (n)

Boyer-Moore O(|∑|) (O ((n - m + 1) + |∑|))

You might also like