Pattern Searching
Pattern Searching
Pattern Searching
10 Check if a string can be formed from another string using given constraints 44
Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1
Contents
2
Contents
32 Replace all occurrences of string AB with C without using extra space 170
Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
36 kasai’s Algorithm for Construction of LCP array from Suffix Array 201
Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
3
Contents
4
Contents
5
Chapter 1
1, 8, 54, 384...
Examples:
Input : 3
Output : 54
For N = 3
Nth term = ( 3*3) * 3!
= 54
Input : 2
Output : 8
On observing carefully, the Nth term in the above series can be generalized as:
6
Chapter 1. Find Nth term of the series 1, 8, 54, 384…
#include <iostream>
using namespace std;
// calculate factorial of N
int fact(int N)
{
int i, product = 1;
for (i = 1; i <= N; i++)
product = product * i;
return product;
}
// Driver Function
int main()
{
int N = 4;
return 0;
}
Java
import java.io.*;
7
Chapter 1. Find Nth term of the series 1, 8, 54, 384…
return (N * N) * fact(N);
}
System.out.println(nthTerm(N));
}
}
Python 3
# Python 3 program to find
# N-th term of the series:
# 1, 8, 54, 384…
# calculate factorial of N
def fact(N):
product = 1
for i in range(1, N + 1):
product = product * i
return product
# calculate Nth term of series
def nthTerm(N):
return (N * N) * fact(N)
# Driver Code
if __name__ ==”__main__”:
N=4
print(nthTerm(N))
# This code is contributed
# by ChitraNayal
C#
class GFG
{
public static int fact(int N)
{
int i, product = 1;
// Calculate factorial of N
8
Chapter 1. Find Nth term of the series 1, 8, 54, 384…
// Driver Code
public static void Main(String[] args)
{
int N = 4; // 4th term is 384
Console.WriteLine(nthTerm(N));
}
}
PHP
Output:
384
Source
https://www.geeksforgeeks.org/find-nth-term-of-the-series-1-8-54-384/
9
Chapter 2
0, 2, 4, 8, 12, 18…
Examples:
Input: 3
Output: 4
For N = 3
Nth term = ( 3 + ( 3 - 1 ) * 3 ) / 2
= 4
Input: 5
Output: 12
On observing carefully, the Nth term in the above series can be generalized as:
Nth term = ( N + ( N - 1 ) * N ) / 2
10
Chapter 2. Find Nth term of the series 0, 2, 4, 8, 12, 18…
// Driver Function
int main()
{
int N = 5;
return 0;
}
Java
import java.io.*;
System.out.println(nthTerm(N));
}
}
Python 3
11
Chapter 2. Find Nth term of the series 0, 2, 4, 8, 12, 18…
def nthTerm(N) :
return (N + N * (N - 1)) // 2
# Driver Code
if __name__ == "__main__" :
N = 5
print(nthTerm(N))
PHP
<?php
// PHP program to find
// N-th term of the series:
// 0, 2, 4, 8, 12, 18...
// Driver Code
$N = 5;
echo nthTerm($N);
Output:
12
Source
https://www.geeksforgeeks.org/find-nth-term-of-the-series-0-2-4-8-12-18/
12
Chapter 3
Input : s1 = aab
s2 = aaaab
Output :6
Substrings of s1 are ["a", "a", "b", "aa",
"ab", "aab"]. These all are present in s2.
Hence, answer is 6.
13
Chapter 3. Number of substrings of one string present in other
{
int ans = 0;
// Driver code
int main()
{
string s1 = "aab", s2 = "aaaab";
cout << countSubstrs(s1, s2);
return 0;
}
Java
class GFG
{
14
Chapter 3. Number of substrings of one string present in other
s3 += s4[j];
// Driver code
public static void main(String[] args)
{
String s1 = "aab", s2 = "aaaab";
System.out.println(countSubstrs(s1, s2));
}
}
Python 3
# Driver code
if __name__ == "__main__" :
s1 = "aab"
s2 = "aaaab"
# function calling
print(countSubstrs(s1, s2))
15
Chapter 3. Number of substrings of one string present in other
C#
// C# program to count number of
// substrings of s1 present in s2.
using System;
class GFG
{
static int countSubstrs(String s1,
String s2)
{
int ans = 0;
for (int i = 0; i < s1.Length; i++) { // s3 stores all substrings of s1 String s3 = ””; char[] s4
= s1.ToCharArray(); for (int j = i; j < s1.Length; j++) { s3 += s4[j]; // check the presence
of s3 in s2 if (s2.IndexOf(s3) != -1) ans++; } } return ans; } // Driver code public static void
Main(String[] args) { String s1 = ”aab”, s2 = ”aaaab”; Console.WriteLine(countSubstrs(s1,
s2)); } } // This code is contributed // by Kirti_Mangal [tabby title=”PHP”]
<?php
// PHP program to count number of
// substrings of s1 present in s2.
// Driver code
16
Chapter 3. Number of substrings of one string present in other
$s1 = "aab";
$s2 = "aaaab";
echo countSubstrs($s1, $s2);
Output:
Source
https://www.geeksforgeeks.org/number-of-substrings-of-one-string-present-in-other/
17
Chapter 4
Examples:
Input: 3
Output: 15
For N = 3, we know that the factorial of 3 is 6
Nth term = 6*(3+2)/2
= 15
Input: 6
Output: 2880
For N = 6, we know that the factorial of 6 is 720
Nth term = 620*(6+2)/2
= 2880
The idea is to first find the factorial of the given number N, that is N!. Now the N-th term
in the above series will be:
C++
18
Chapter 4. Find Nth term of series 1, 4, 15, 72, 420…
// return factorial of N
return fact;
}
// Driver Function
int main()
{
int N = 6;
return 0;
}
Java
class GFG
{
19
Chapter 4. Find Nth term of series 1, 4, 15, 72, 420…
int fact = 1;
// return factorial of N
return fact;
}
// Driver Code
public static void main(String args[])
{
int N = 6;
System.out.println(nthTerm(N));
}
}
Python3
# return factorial of N
return fact
20
Chapter 4. Find Nth term of series 1, 4, 15, 72, 420…
# Driver code
if __name__ == "__main__" :
N = 6
# Function Calling
print(nthTerm(N))
C#
// C# program to find N-th
// term of the series:
// 1, 4, 15, 72, 420
using System;
class GFG
{
// Function to find factorial of N
static int factorial(int N)
{
int fact = 1;
for (int i = 1; i <= N; i++) fact = fact * i; // return factorial of N return fact; } //
calculate Nth term of series static int nthTerm(int N) { return (factorial(N) * (N + 2) / 2);
} // Driver Code public static void Main() { int N = 6; Console.Write(nthTerm(N)); } } //
This code is contributed by ChitraNayal [tabby title=”PHP”]
<?php
// PHP program to find
// N-th term of the series:
// 1, 4, 15, 72, 420…
// return factorial of N
return $fact;
}
21
Chapter 4. Find Nth term of series 1, 4, 15, 72, 420…
// Driver code
$N = 6;
// Function Calling
echo nthTerm($N);
Output:
2880
Source
https://www.geeksforgeeks.org/find-nth-term-of-series-1-4-15-72-420/
22
Chapter 5
Input : forxxorfxdofr
for
Output : 3
Explanation : Anagrams of the word for - for, orf,
ofr appear in the text and hence the count is 3.
Input : aabaabaa
aaba
Output : 4
Explanation : Anagrams of the word aaba - aaba,
abaa each appear twice in the text and hence the
count is 4.
A simple approach is to traverse from start of the string considering substrings of length
equal to the length of the given word and then check if this substring has all the characters
of word.
23
Chapter 5. Count Occurences of Anagrams
// Initialize result
int res = 0;
return res;
}
// Driver code
public static void main(String args[])
{
String text = "forxxorfxdofr";
String word = "for";
System.out.print(countAnagrams(text, word));
}
}
24
Chapter 5. Count Occurences of Anagrams
Output:
An Efficient Solution is to use count array to check for anagrams, we can construct current
count window from previous window in O(1) time using sliding window concept.
25
Chapter 5. Count Occurences of Anagrams
count[text.charAt(i)]--;
// Driver code
public static void main(String args[])
{
String text = "forxxorfxdofr";
String word = "for";
System.out.print(countAnagrams(text, word));
}
}
Output:
Source
https://www.geeksforgeeks.org/count-occurences-of-anagrams/
26
Chapter 6
Minimum characters to be
replaced to remove the given
substring
C++
27
Chapter 6. Minimum characters to be replaced to remove the given substring
// mismatch occurs
if (A[i + j] != B[j])
break;
}
}
}
return count;
}
// Driver Code
int main()
{
string str1 = "aaaaaaaa";
string str2 = "aaa";
return 0;
}
28
Chapter 6. Minimum characters to be replaced to remove the given substring
Java
// Java implementation of
// above approach
import java.io.*;
// mismatch occurs
if(i + j >= n)
break;
else if (A.charAt(i + j) != B.charAt(j))
break;
}
}
}
return count;
}
// Driver Code
public static void main(String args[])
29
Chapter 6. Minimum characters to be replaced to remove the given substring
{
String str1 = "aaaaaaaa";
String str2 = "aaa";
System.out.println(replace(str1 , str2));
}
}
Output:
Time Complexity: O(len1 * len2), where len1 is the length of first string and len2 is the
length of second string.
Also, this problem can be solved directly by using Python’s in-built function-
string1.count(string2)
# inbuilt function
answer = str1.count(str2)
print(answer)
Output:
Improved By : tufan_gupta2000
Source
https://www.geeksforgeeks.org/minimum-characters-to-be-replaced-to-remove-the-given-substring/
30
Chapter 7
Cryptography is a very straightforward concept which deals with manipulating the strings
(or text) to make them unreadable for the intermediate person. It has a very effective way
to encrypt or decrypts the text coming from the other parties. Some of the examples
are, Caesar Cipher, Viginere Cipher, Columner Cipher, DES, AES and the list continues.
To develop custom cryptography algorithm, hybrid encryption algorithms can be used.
Hybrid Encryption is a concept in cryptography which combines/merge one/two
cryptography algorithms to generate more effective encrypted text.
Example:
FibBil Cryptography Algorithm
Problem Statement:
Program to generate an encrypted text, by computing Fibonacci Series, adding the terms
of Fibonacci Series with each plaintext letter, until the length of the key.
Algorithm:
For Encryption: Take an input plain text and key from the user, reverse the plain text
and concatenate the plain text with the key, Copy the string into an array. After copying,
separate the array elements into two parts, EvenArray, and OddArray in which even index
of an array will be placed in EvenArray and same for OddArray. Start generating the
Fibonacci Series F(i) up-to-the length of the keyj such that c=i+j where c is cipher text
31
Chapter 7. Custom Building Cryptography Algorithms (Hybrid Cryptography)
with mod 26. Append all the cth elements in a CipherString and, so Encryption Done!.
When sum up concept is use, it highlights of implementing Caesar Cipher.
For Decryption: Vice Versa of the Encryption Algorithm
Example for the Algorithm:
Input: hello
Key: abcd
Output: riobkxezg
Reverse the input, olleh, append this with the key i.e. ollehabcd.
EvenString: leac
OddString: olhbd
As key length is 4, 4 times loop will be generated including FibNum 0, which is
ignored.
For EvenArray Ciphers:
FibNum: 1
In Even Array for l and FibNum 1 cip is k
In Even Array for e and FibNum 1 cip is d
In Even Array for a and FibNum 1 cip is z
In Even Array for c and FibNum 1 cip is b
FibNum: 2
In Even Array for l and FibNum 2 cip is j
In Even Array for e and FibNum 2 cip is c
In Even Array for a and FibNum 2 cip is y
In Even Array for c and FibNum 2 cip is a
FibNum: 3 (Final Computed letters)
In Even Array for l and FibNum 3 cip is i
In Even Array for e and FibNum 3 cip is b
In Even Array for a and FibNum 3 cip is x
In Even Array for c and FibNum 3 cip is z
For OddArray Ciphers
FibNum: 1
In Odd Array for o and FibNum 1 cip is p
In Odd Array for l and FibNum 1 cip is m
In Odd Array for h and FibNum 1 cip is i
In Odd Array for b and FibNum 1 cip is c
In Odd Array for d and FibNum 1 cip is e
FibNum: 2
In Odd Array for o and FibNum 2 cip is q
In Odd Array for l and FibNum 2 cip is n
In Odd Array for h and FibNum 2 cip is j
In Odd Array for b and FibNum 2 cip is d
In Odd Array for d and FibNum 2 cip is f
FibNum: 3 (Final Computed letters)
In Odd Array for o and FibNum 3 cip is r
In Odd Array for l and FibNum 3 cip is o
In Odd Array for h and FibNum 3 cip is k
32
Chapter 7. Custom Building Cryptography Algorithms (Hybrid Cryptography)
Program:
import java.util.*;
import java.lang.*;
class GFG {
33
Chapter 7. Custom Building Cryptography Algorithms (Hybrid Cryptography)
}
else {
evenString = evenString + Character.toString(stringArray[i]);
}
}
evenArray = new char[evenString.length()];
oddArray = new char[oddString.length()];
else {
34
Chapter 7. Custom Building Cryptography Algorithms (Hybrid Cryptography)
cip = p + c;
if (cip > '9')
cip = cip - 9;
}
else {
cip = p + c;
if (cip > 'z') {
cip = cip - 26;
}
}
oddArray[i] = (char)cip;
// Caesar Cipher Algorithm End
}
m++;
}
}
Output:
riobkxezg
Conclusion:
35
Chapter 7. Custom Building Cryptography Algorithms (Hybrid Cryptography)
Hybrid Algorithms for the cryptography are effective and so, it is not very easy to detect
the pattern and decode the message. Here, the algorithm is a combination of mathematical
function and Caesar Cipher, so as to implement Hybrid Cryptography Algorithm.
Source
https://www.geeksforgeeks.org/custom-building-cryptography-algorithms-hybrid-cryptography/
36
Chapter 8
37
Chapter 8. Program to find all match of a regex in a string
return 0;
}
Output:
Note: Above code is running perfectly fine but the problem is input string will be
lost.
• Using iterator:
Object can be constructed by calling the constructor with three parameters: a string
iterator indicating the starting position of the search, a string iterator indicating the
ending position of the search, and the regex object. Construct another iterator object
using the default constructor to get an end-of-sequence iterator.
#include <bits/stdc++.h>
using namespace std;
int main()
{
string subject("geeksforgeeksabcdefghg"
"eeksforgeeksabcdgeeksforgeeks");
// regex object.
regex re("geeks(for)geeks");
Output:
38
Chapter 8. Program to find all match of a regex in a string
Source
https://www.geeksforgeeks.org/program-to-find-all-match-of-a-regex-in-a-string/
39
Chapter 9
40
Chapter 9. KLA Tencor Interview Experience | Set 3
cddwww
abc
output:
(0, 0, horizontal)
(2, 0, horizontal)
(2, 0, vertical)
(1, 0, diagonal)
Round 2
This is a 1-hour telephonic interview.
1. Tell me about yourself.
2. Discussion on constructor/instructor
3. Memory leak.
4. Virtual function some more cpp questions.
5. Discussion on core dump, corruption, how handle all these scenarios.
6. Discussion on mallow(), new().
7. Find the k most frequent words from a file.
https://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Round 3
This is a 1-hour telephonic interview.
1. Discussion on my current project in depth.
2. Leaders in an array
https://www.geeksforgeeks.org/leaders-in-an-array/
3. Find the Number Occurring Odd Number of Times
https://www.geeksforgeeks.org/find-the-number-occurring-odd-number-of-times/
4. Design a contact app for android.(mostly focus on efficient algorithm).
Round 4
Face To Face Interview on Programming Questions Explain on Paper (1-hour)
1. Discussion on my current project in depth.
2. Clone a linked list with next and random pointer (all possible approach).
https://www.geeksforgeeks.org/a-linked-list-with-next-and-arbit-pointer/
3. Find the middle of a given linked list
https://www.geeksforgeeks.org/write-a-c-function-to-print-the-middle-of-the-linked-list/
4. Reverse a linked list
https://www.geeksforgeeks.org/reverse-a-linked-list/
5. Given a number N you need make it to 0 by subtracting K or 1 from N, but condition is
you need select K in such a way that after subtracting K the result should be
factored of N.
example N = 10 that first K=5 after subtracting K from N 10-5=5 hence 5 is factors of 10.
Find minimum number of substation operation to make it 0.
6. Some more questions on array and linked list.
Round 5
Face To Face Interview on Programming Questions Explain on Paper (1-hour)
This round was taken by a senior folk.
41
Chapter 9. KLA Tencor Interview Experience | Set 3
1. Given a image in the form of 2D array of numbers. You need to corrupt that images and
return.
Condition for corruption is the element at index [x][y] should contain average of surrounding
numbers.
Example.
1234
6789 –here in place of 7 –> 4
2345
2. Next Greater Element
https://www.geeksforgeeks.org/next-greater-element/
Round 6
Face To Face Interview on Design and OOPS(1- hour).
This round was taken by a manager.
1. Design class diagram for coffee machine.(with all possible classes object, functions and
type of data)
Most focus on object interaction.
Round 7
This round was taken by a senior manager.
1. Tell me about your self, your family and all?.
2. What is the one thing that can make you stay with KLA Tencor?
3. Design ola/uber cost estimation functionality Focus on Factory design pattern.
4. More HR related questions.
Round 8
This round was based on Behavioral Skills taken by senior HR Manager.
1. Tell me about your self.
2. Why KLA Tencor.
3. More HR related questions.
I prepared mostly from GeeksforGeeks and I want to say thanks to the content writers of
Geeks for providing the best solutions. This is one of the best sites.
Note-
1. Mostly Focus on algorithm, How efficient you can write algorithm.
2. Also Focus on system design questions. For getting some Idea or to start with system
design question refer to below given link.
https://www.youtube.com/watch?v=UzLMhqg3_Wc&list=PLrmLmBdmIlps7GJJWW9I7N0P0rB0C3eY2
3. To begin with design pattern refer below given link to start with.
https://www.youtube.com/watch?v=rI4kdGLaUiQ&list=PL6n9fhu94yhUbctIoxoVTrklN3LMwTCmd
4. Ask questions at the end of your interview to the interviewers.
5. Before start writing the code try to explain your algorithm.
42
Chapter 9. KLA Tencor Interview Experience | Set 3
Source
https://www.geeksforgeeks.org/kla-tencor-interview-experience-set-3/
43
Chapter 10
Check if a string can be formed from another string using given constraints - GeeksforGeeks
Given two strings S1 and S2(all characters are in lower-case). The task is to check if S2 can
be formed from S1 using given constraints:
1. Characters of S2 is there in S1 if there are two ‘a’ in S2, then S1 should have two ‘a’ also.
2. If any character of S2 is not present in S1, check if the previous two ASCII characters
are there in S1. e.g., if ‘e’ is there in S2 and not in S1, then ‘c’ and ‘d’ can be used from S1
to make ‘e’.
Note: All characters from S1 can be used only once.
Examples:
Approach: The above problem can be solved using hashing. The count of all the characters
in S1 is stored in a hash-table. Traverse in the string, and check if the character in S2 is
there in the hash-table, reduce the count of that particular character in the hash-table. If
the character is not there in the hash-table, check if the previous two ASCII characters are
there in the hash-table, then reduce the count of the previous two ASCII characters in the
44
Chapter 10. Check if a string can be formed from another string using given constraints
hash-table. If all the characters can be formed from S1 using the given constraints, the
string S2 can be formed from S1, else it cannot be formed.
Below is the implementation of the above approach:
mp[S2[i] - 1]--;
mp[S2[i] - 2]--;
}
else {
return false;
}
}
return true;
}
45
Chapter 10. Check if a string can be formed from another string using given constraints
// Driver Code
int main()
{
string S1 = "abbat";
string S2 = "cat";
Output:
YES
Source
https://www.geeksforgeeks.org/check-if-a-string-can-be-formed-from-another-string-using-given-constraints/
46
Chapter 11
Input :
47
Chapter 11. Largest connected component on a grid
Approach :
The approach is to visualize the given grid as a graph with each cell representing a separate
node of the graph and each node connected to four other nodes which are to immediately
up, down, left, and right of that grid. Now doing a BFS search for every node of the graph,
find all the nodes connected to the current node with same color value as the current node.
Here is the graph for above example :
48
Chapter 11. Largest connected component on a grid
C++
const int n = 6;
const int m = 8;
49
Chapter 11. Largest connected component on a grid
visited[i][j] = 1;
COUNT++;
50
Chapter 11. Largest connected component on a grid
51
Chapter 11. Largest connected component on a grid
// updating result
if (COUNT >= current_max) {
current_max = COUNT;
reset_result(input[i][j], input);
}
reset_visited();
COUNT = 0;
// updating result
if (COUNT >= current_max) {
current_max = COUNT;
reset_result(input[i][j], input);
}
}
}
print_result(current_max);
}
// Drivers Code
int main()
{
int input[n][m] = { { 1, 4, 4, 4, 4, 3, 3, 1 },
{ 2, 1, 1, 4, 3, 3, 1, 1 },
{ 3, 2, 1, 1, 2, 3, 2, 1 },
{ 3, 3, 2, 1, 2, 2, 2, 2 },
{ 3, 1, 3, 1, 1, 4, 4, 4 },
{ 1, 1, 3, 1, 1, 4, 4, 4 } };
Java
52
Chapter 11. Largest connected component on a grid
class GFG
{
static final int n = 6;
static final int m = 8;
53
Chapter 11. Largest connected component on a grid
visited[i][j] = 1;
COUNT++;
54
Chapter 11. Largest connected component on a grid
}
}
}
// updating result
if (COUNT >= current_max)
{
current_max = COUNT;
reset_result(input[i][j], input);
55
Chapter 11. Largest connected component on a grid
}
reset_visited();
COUNT = 0;
// updating result
if (COUNT >= current_max)
{
current_max = COUNT;
reset_result(input[i][j], input);
}
}
}
print_result(current_max);
}
// Driver Code
public static void main(String args[])
{
int input[][] = {{1, 4, 4, 4, 4, 3, 3, 1},
{2, 1, 1, 4, 3, 3, 1, 1},
{3, 2, 1, 1, 2, 3, 2, 1},
{3, 3, 2, 1, 2, 2, 2, 2},
{3, 1, 3, 1, 1, 4, 4, 4},
{1, 1, 3, 1, 1, 4, 4, 4}};
Output:
56
Chapter 11. Largest connected component on a grid
Improved By : tufan_gupta2000
Source
https://www.geeksforgeeks.org/largest-connected-component-on-a-grid/
57
Chapter 12
Input : N = 3, M = 4
Output : 8
Explanation:
array is 4 0 0
In 4 steps element at index 1 is increased, so the array becomes {4, 4, 0}. In
the next 4 steps the element at index 3 is increased so array becomes {4, 4, 4}
Thus, 4 + 4 = 8 operations are required to make all the array elements equal
Input : N = 4, M = 4
Output : 9
Explanation:
The steps are shown in the flowchart given below
Refer to the flowchart given below.
58
Chapter 12. Make array elements equal in Minimum Steps
59
Chapter 12. Make array elements equal in Minimum Steps
Approach:
To maximise the Number of Increments per Step, more number of Unbalances are created
(array[i]>array[i+1]),
Step 1, element 0 >element 1 so element 1 is incremented,
Step 2, element 1> element 2 so element 2 is incremented by 1
Step 3, element 0 > element 1 and element 2> element 3 so element 1 &3 are incremented
by 1
Step 4, element 1 > element 2 element 3 > element 4 so element 2 & 4 are incremented
Step 5, element 0> element 1; element 2>element 3 ;element 4> element 5; so element 1, 3,
&5 are incremented.
and so on…
Consider the following array,
500000
1) 5 1 0 0 0 0
2) 5 1 1 0 0 0
3) 5 2 1 1 0 0
4) 5 2 2 1 1 0
5) 5 3 2 2 1 1
6) 5 3 3 2 2 1
7) 5 4 3 3 2 2
8) 5 4 4 3 3 2
9) 5 5 4 4 3 3
10) 5 5 5 4 4 3
11) 5 5 5 5 4 4
12) 5 5 5 5 5 4
13) 5 5 5 5 5 5
Notice that after an unbalance is created (i.e array[i]>array[i+1]) the element gets incre-
mented by one in alternate steps. In step 1 element 1 gets incremented to 1, in step 2
element 2 gets incremented to 1, in step 3 element 3 gets incremented to 1, so in step n-1,
n-1th element will become 1. After that n-1th element is increased by 1 on alternate steps
until it reaches the value at element 0. Then the entire array becomes equal.
So the pattern followed by the last element is
(0, 0, 0.., 0) till (N – 4)th element becomes 1 which is n-4 steps
and after that,
(0, 0, 1, 1, 2, 2, 3, 3, 4, 4, … M – 1, M – 1, M) which is 2*m + 1 steps.
So the Final Result becomes (N – 3) + 2 * M
There are a few corner cases which need to be handled, viz. When N = 1, array has only
a single element, so the number of steps required = 0. and When N = 2, number of steps
required equal to M
C++
60
Chapter 12. Make array elements equal in Minimum Steps
#include <bits/stdc++.h>
using namespace std;
return 2 * M + (N - 3);
}
// Driver Code
int main()
{
int N = 4, M = 4;
cout << steps(N, M);
return 0;
}
Java
import java.io.*;
class GFG {
61
Chapter 12. Make array elements equal in Minimum Steps
return M;
return 2 * M + (N - 3);
}
// Driver Code
public static void main (String[] args)
{
int N = 4, M = 4;
System.out.print( steps(N, M));
}
}
Python3
return 2 * M + (N - 3)
# Driver Code
N = 4
M = 4
print(steps(N,M))
C#
62
Chapter 12. Make array elements equal in Minimum Steps
class GFG
{
return 2 * M + (N - 3);
}
// Driver Code
public static void Main ()
{
int N = 4, M = 4;
Console.WriteLine(steps(N, M));
}
}
Output:
Source
https://www.geeksforgeeks.org/make-array-elements-equal-in-minimum-steps/
63
Chapter 13
Parameters Used: This function accepts three mandatory parameters and all of these
parameters are described below.
Return Value: This function returns a modified string or array if matches found. If
matches not fount in the original string then it will return unchanged original string or
array.
Note: The ereg_replace() function is case sensitive in PHP. This function was deprecated
in PHP 5.3.0, and removed in PHP 7.0.0.
Examples:
64
Chapter 13. PHP | ereg_replace() Function
<?php
// Pattern to be searched
$string_pattern = "(.*)any(.*)";
// Replace string
$replace_string = " own yours own \\1biography\\2";
?>
Output:
Note: While using an integer value as the replacement parameter, we do not get expected
result as the function interpret the number to ordinal value of character.
Program 2:
<?php
65
Chapter 13. PHP | ereg_replace() Function
// Replace string
$replace_string = 5;
// This function call will not show the expected output as the
// function interpret the number to ordinal value of character.
echo ereg_replace('Fifth',$replace_string, $original_string);
// Replace String
$replace_string = '5';
?>
Output:
Reference: http://php.net/manual/en/function.ereg-replace.php
Source
https://www.geeksforgeeks.org/php-ereg_replace-function/
66
Chapter 14
Examples:
We have discussed an approach in earlier post which handles substring match as a pattern.
In this post, we will be going to use KMP algorithm’s lps (longest proper prefix which is
also suffix) construction, which will help in finding the longest match of the prefix of string
b and suffix of string a. By which we will know the rotating point, from this point match
the characters. If all the characters are matched, then it is a rotation, else not.
Below is the basic implementation of the above approach.
Java
67
Chapter 14. Check if strings are rotations of each other or not | Set 2
i = 0;
// Driver code
public static void main(String[] args)
{
String s1 = "ABACD";
String s2 = "CDABA";
68
Chapter 14. Check if strings are rotations of each other or not | Set 2
C#
// C# program to check if
// two strings are rotations
// of each other.
using System;
class GFG
{
public static bool isRotation(string a,
string b)
{
int n = a.Length;
int m = b.Length;
if (n != m)
return false;
// lps[0] is always 0
lps[0] = 0;
69
Chapter 14. Check if strings are rotations of each other or not | Set 2
lps[i] = 0;
++i;
}
else
{
len = lps[len - 1];
}
}
}
i = 0;
// Driver code
public static void Main()
{
string s1 = "ABACD";
string s2 = "CDABA";
Console.WriteLine(isRotation(s1, s2) ?
"1" : "0");
}
}
Output:
70
Chapter 14. Check if strings are rotations of each other or not | Set 2
Source
https://www.geeksforgeeks.org/check-strings-rotations-not-set-2/
71
Chapter 15
All those strings that are ending with any of the above mentioned forms of “the” are not
accepted.
Determinitis finite automata (DFA) of strings that not ending with “THE” –
The initial and starting state in this dfa is Qo
72
Chapter 15. DFA for Strings not ending with “THE”
Input : XYzabCthe
Output : NOT ACCEPTED
Input : Themaliktth
73
Chapter 15. DFA for Strings not ending with “THE”
Output : ACCEPTED
C++
74
Chapter 15. DFA for Strings not ending with “THE”
else if (dfa == 1)
state1(str[i]);
else if (dfa == 2)
state2(str[i]);
else
state3(str[i]);
}
// driver code
int main()
{
char str[] = "forTHEgeeks";
if (isAccepted(str) == true)
printf("ACCEPTED\n");
else
printf("NOT ACCEPTED\n");
return 0;
}
PHP
<?php
75
Chapter 15. DFA for Strings not ending with “THE”
76
Chapter 15. DFA for Strings not ending with “THE”
$dfa = 3;
else
$dfa = 0;
}
function isAccepted($str)
{
global $dfa;
// store length of string
$len = strlen($str);
else if ($dfa == 1)
state1($str[$i]);
else if ($dfa == 2)
state2($str[$i]);
else
state3($str[$i]);
}
// Driver Code
$str = "forTHEgeeks";
if (isAccepted($str) == true)
echo "ACCEPTED\n";
else
77
Chapter 15. DFA for Strings not ending with “THE”
Output :
ACCEPTED
Source
https://www.geeksforgeeks.org/dfa-for-strings-not-ending-with-the/
78
Chapter 16
A simple solution is to one by one check every index of s2. For every index, check if s1 is
present.
C++
79
Chapter 16. Check if a string is substring of another
int N = s2.length();
if (j == M)
return i;
}
return -1;
}
Java
80
Chapter 16. Check if a string is substring of another
pattern match */
for (j = 0; j < M; j++)
if (s2.charAt(i + j) != s1.charAt(j))
break;
if (j == M)
return i;
}
return -1;
}
if (res == -1)
System.out.println("Not present");
else
System.out.println("Present at index "
+ res);
}
}
Python 3
# Python 3 program to check if
# a string is substring of other.
# Returns true if s2 is substring of s1
def isSubstring(s1, s2):
M = len(s1)
N = len(s2)
# A loop to slide pat[] one by one
for i in range(N – M + 1):
# For current index i,
# check for pattern match
for j in range(M):
if (s2[i + j] != s1[j]):
break
if j + 1 == M :
return i
81
Chapter 16. Check if a string is substring of another
return -1
# Driver Code
if __name__ == “__main__”:
s1 = “for”
s2 = “geeksforgeeks”
res = isSubstring(s1, s2)
if res == -1 :
print(“Not present”)
else:
print(“Present at index ” + str(res))
# This code is contributed by ChitraNayal
C#
if (j == M)
return i;
}
return -1;
}
82
Chapter 16. Check if a string is substring of another
if (res == -1)
Console.Write("Not present");
else
Console.Write("Present at index "
+ res);
}
}
PHP
<?php
// PHP program to check if a
// string is substring of other.
// Returns true if s2
// is substring of s1
function isSubstring($s1, $s2)
{
$M = strlen($s1);
$N = strlen($s2);
// A loop to slide
// pat[] one by one
for ($i = 0; $i <= $N - $M; $i++)
{
$j = 0;
if ($j == $M)
return $i;
}
return -1;
}
// Driver Code
$s1 = "for";
$s2 = "geeksforgeeks";
83
Chapter 16. Check if a string is substring of another
Output:
Present at index 5
• Java Substring
• substr in C++
• Python find
Source
https://www.geeksforgeeks.org/check-string-substring-another/
84
Chapter 17
The idea is to first split given sentence into different words. Then traverse the word list.
For every word in the word list, check if it matches with given word. If yes, then replace
the word with stars in the list. Finally merge the words of list and print.
85
Chapter 17. Program to replace a word with asterisks in a sentence
# count variable to
# access our word_list
count = 0
if i == word:
return result
# Driver code
if __name__== '__main__':
Output :
Source
https://www.geeksforgeeks.org/program-censor-word-asterisks-sentence/
86
Chapter 18
Dynamic Programming |
Wildcard Pattern Matching |
Linear Time and Constant
Space
Dynamic Programming | Wildcard Pattern Matching | Linear Time and Constant Space -
GeeksforGeeks
Given a text and a wildcard pattern, find if wildcard pattern is matched with text. The
matching should cover the entire text (not partial text).
The wildcard pattern can include the characters ‘?’ and ‘*’
‘?’ – matches any single character
‘*’ – Matches any sequence of characters (including the empty sequence)
Prerequisite : Dynamic Programming | Wildcard Pattern Matching
Examples:
Text = "baaabab",
Pattern = “*****ba*****ab", output : true
Pattern = "baaa?ab", output : true
Pattern = "ba*a?", output : true
Pattern = "a*ab", output : false
87
Chapter 18. Dynamic Programming | Wildcard Pattern Matching | Linear Time and
Constant Space
Each occurrence of ‘?’ character in wildcard pattern can be replaced with any other character
and each occurrence of ‘*’ with a sequence of characters such that the wildcard pattern
becomes identical to the input string after replacement.
We have discussed a solution here which has O(m x n) time and O(m x n) space complexity.
For applying the optimization, we will at first note the BASE CASE which involves :
If the length of the pattern is zero then answer will be true only if the length of the text
with which we have to match the pattern is also zero.
ALGORITHM | (STEP BY STEP)
Step – (1) : Let i be the marker to point at the current character of the text.
Let j be the marker to point at the current character of the pattern.
Let index_txt be the marker to point at the character of text on which we encounter ‘*’ in
pattern.
Let index_pat be the marker to point at the position of ‘*’ in the pattern.
NOTE : WE WILL TRAVERSE THE GIVEN STRING AND PATTERN US-
ING A WHILE LOOP
Step – (2) : At any instant if we observe that txt[i] == pat[j], then we increment both i and
j as no operation needs to be performed in this case.
88
Chapter 18. Dynamic Programming | Wildcard Pattern Matching | Linear Time and
Constant Space
Step – (3) : If we encounter pat[j] == ‘?’, then it resembles the case mentioned in step –
(2) as ‘?’ has the property to match with any single character.
Step – (4) : If we encounter pat[j] == ‘*’, then we update the value of index_txt and
index_pat as ‘*’ has the property to match any sequence of characters (including the empty
sequence) and we will increment the value of j to compare next character of pattern with
the current character of the text.(As character represented by i has not been answered yet).
Step – (5) : Now if txt[i] == pat[j], and we have encountered a ‘*’ before, then it means
that ‘*’ included the empty sequence, else if txt[i] != pat[j], a character needs to be provided
by ‘*’ so that current character matching takes place, then i needs to be incremented as it
is answered now but the character represented by j still needs to be answered, therefore, j =
index_pat + 1, i = index_txt + 1 (as ‘*’ can capture other characters as well), index_txt++
(as current character in text is matched).
Step – (6) : If step – (5) is not valid, that means txt[i] != pat[j], also we have not encountered
a ‘*’ that means it is not possible for the pattern to match the string. (return false).
Step – (7) : Check whether j reached its final value or not, then return the final answer.
Let us see the above algorithm in action, then we will move to the coding section
:
text = “baaabab”
pattern = “*****ba*****ab”
NOW APPLYING THE ALGORITHM
Step – (1) : i = 0 (i –> ‘b’)
j = 0 (j –> ‘*’)
index_txt = -1
index_pat = -1
NOTE : LOOP WILL RUN TILL i REACHES ITS FINAL
VALUE OR THE ANSWER BECOMES FALSE MIDWAY.
FIRST COMPARISON :-
As we see here that pat[j] == ‘*’, therefore directly jumping on to step – (4).
Step – (4) : index_txt = i (index_txt –> ‘b’)
index_pat = j (index_pat –> ‘*’)
j++ (j –> ‘*’)
After four more comparisons : i = 0 (i –> ‘b’)
j = 5 (j –> ‘b’)
index_txt = 0 (index_txt –> ‘b’)
index_pat = 4 (index_pat –> ‘*’)
SIXTH COMPARISON :-
As we see here that txt[i] == pat[j], but we already encountered ‘*’ therefore using step –
(5).
Step – (5) : i = 1 (i –> ‘a’)
j = 6 (j –> ‘a’)
index_txt = 0 (index_txt –> ‘b’)
index_pat = 4 (index_pat –> ‘*’)
89
Chapter 18. Dynamic Programming | Wildcard Pattern Matching | Linear Time and
Constant Space
SEVENTH COMPARISON :-
Step – (5) : i = 2 (i –> ‘a’)
j = 7 (j –> ‘*’)
index_txt = 0 (index_txt –> ‘b’)
index_pat = 4 (index_pat –> ‘*’)
EIGTH COMPARISON :-
Step – (4) : i = 2 (i –> ‘a’)
j = 8 (j –> ‘*’)
index_txt = 2 (index_txt –> ‘a’)
index_pat = 7 (index_pat –> ‘*’)
After four more comparisons : i = 2 (i –> ‘a’)
j = 12 (j –> ‘a’)
index_txt = 2 (index_txt –> ‘a’)
index_pat = 11 (index_pat –> ‘*’)
THIRTEENTH COMPARISON :-
Step – (5) : i = 3 (i –> ‘a’)
j = 13 (j –> ‘b’)
index_txt = 2 (index_txt –> ‘a’)
index_pat = 11 (index_pat –> ‘*’)
FOURTEENTH COMPARISON :-
Step – (5) : i = 3 (i –> ‘a’)
j = 12 (j –> ‘a’)
index_txt = 3 (index_txt –> ‘a’)
index_pat = 11 (index_pat –> ‘*’)
FIFTEENTH COMPARISON :-
Step – (5) : i = 4 (i –> ‘b’)
j = 13 (j –> ‘b’)
index_txt = 3 (index_txt –> ‘a’)
index_pat = 11 (index_pat –> ‘*’)
SIXTEENTH COMPARISON :-
Step – (5) : i = 5 (i –> ‘a’)
j = 14 (j –> end)
index_txt = 3 (index_txt –> ‘a’)
index_pat = 11 (index_pat –> ‘*’)
SEVENTEENTH COMPARISON :-
Step – (5) : i = 4 (i –> ‘b’)
j = 12 (j –> ‘a’)
index_txt = 4 (index_txt –> ‘b’)
index_pat = 11 (index_pat –> ‘*’)
EIGHTEENTH COMPARISON :-
90
Chapter 18. Dynamic Programming | Wildcard Pattern Matching | Linear Time and
Constant Space
// step-1 :
// initailze markers :
int i = 0, j = 0, index_txt = -1,
index_pat = -1;
while (i < n) {
91
Chapter 18. Dynamic Programming | Wildcard Pattern Matching | Linear Time and
Constant Space
j++;
}
// Final Check
if (j == m) {
return true;
}
return false;
}
// Driver code
int main()
{
char str[] = "baaabab";
char pattern[] = "*****ba*****ab";
// char pattern[] = "ba*****ab";
92
Chapter 18. Dynamic Programming | Wildcard Pattern Matching | Linear Time and
Constant Space
if (strmatch(str, pattern,
strlen(str), strlen(pattern)))
cout << "Yes" << endl;
else
cout << "No" << endl;
return 0;
}
Output:
Yes
No
Source
https://www.geeksforgeeks.org/dynamic-programming-wildcard-pattern-matching-linear-time-constant-space/
93
Chapter 19
94
Chapter 19. Pattern Searching using C++ library
int main()
{
string txt = "aaaa", pat = "aa";
printOccurrences(txt, pat);
return 0;
}
Output:
Source
https://www.geeksforgeeks.org/pattern-searching-using-c-library/
95
Chapter 20
Input : aabcdaabc
Output : 4
The string "aabc" is the longest
prefix which is also suffix.
Input : abcab
Output : 2
Input : aaaa
Output : 2
Simple Solution : Since overlapping of prefix and suffix is not allowed, we break the string
from middle and start matching left and right string. If they are equal return size of any
one string else try for shorter lengths on both sides.
Below is a solution of above approach!
C++
96
Chapter 20. Longest prefix which is also suffix
int n = str.length();
if(n < 2) {
return 0;
}
int len = 0;
int i = n/2;
while(i < n) {
if(str[i] == str[len]) {
++len;
++i;
} else {
if(len == 0) { // no prefix
++i;
} else { // search for shorter prefixes
--len;
}
}
}
return len;
// Driver code
int main() {
string s = "blablabla";
return 0;
}
Java
97
Chapter 20. Longest prefix which is also suffix
int n = s.length();
if(n < 2) {
return 0;
}
int len = 0;
int i = n/2;
while(i < n) {
if(s.charAt(i) == s.charAt(len)) {
++len;
++i;
}
else
{
if(len == 0) { // no prefix
++i;
}
else
{
// search for shorter prefixes
--len;
}
}
}
return len;
// Driver code
public static void main (String[] args)
{
String s = "blablabla";
System.out.println(longestPrefixSuffix(s));
}
}
Python3
def longestPrefixSuffix(s) :
98
Chapter 20. Longest prefix which is also suffix
n = len(s)
if (prefix == suffix) :
return res
s = "blablabla"
print(longestPrefixSuffix(s))
C#
class GFG {
if(n < 2)
return 0;
int len = 0;
int i = n / 2;
while(i < n) {
if(s[i] == s[len]) {
++len;
++i;
}
else {
if(len == 0) {
99
Chapter 20. Longest prefix which is also suffix
// no prefix
++i;
}
else {
return len;
}
// Driver code
public static void Main ()
{
String s = "blablabla";
Console.WriteLine(longestPrefixSuffix(s));
}
}
Output:
Efficient Solution : The idea is to use preprocessing algorithm of KMP search. In the
preprocessing algorithm, we build lps array which stores following values.
C++
100
Chapter 20. Longest prefix which is also suffix
int lps[n];
lps[0] = 0; // lps[0] is always 0
101
Chapter 20. Longest prefix which is also suffix
Java
class GFG
{
// Returns length of the longest prefix
// which is also suffix and the two do
// not overlap. This function mainly is
// copy computeLPSArray() of in below post
// https://www.geeksforgeeks.org/searching-
// for-patterns-set-2-kmp-algorithm/
static int longestPrefixSuffix(String s)
{
int n = s.length();
// lps[0] is always 0
lps[0] = 0;
102
Chapter 20. Longest prefix which is also suffix
// (pat[i] != pat[len])
else
{
// This is tricky. Consider
// the example. AAACAAAA
// and i = 7. The idea is
// similar to search step.
if (len != 0)
{
len = lps[len-1];
// if (len == 0)
else
{
lps[i] = 0;
i++;
}
}
}
// Driver program
public static void main (String[] args)
{
String s = "abcab";
System.out.println(longestPrefixSuffix(s));
}
}
Python3
103
Chapter 20. Longest prefix which is also suffix
else :
# (pat[i] != pat[len])
# This is tricky. Consider
# the example. AAACAAAA
# and i = 7. The idea is
# similar to search step.
if (l != 0) :
l = lps[l-1]
else :
# if (len == 0)
lps[i] = 0
i = i + 1
res = lps[n-1]
104
Chapter 20. Longest prefix which is also suffix
return n//2
else :
return res
C#
class GFG {
// lps[0] is always 0
lps[0] = 0;
105
Chapter 20. Longest prefix which is also suffix
lps[i] = len;
i++;
}
// (pat[i] != pat[len])
else
{
// if (len == 0)
else
{
lps[i] = 0;
i++;
}
}
}
// Driver program
public static void Main ()
{
string s = "abcab";
Console.WriteLine(longestPrefixSuffix(s));
}
}
PHP
106
Chapter 20. Longest prefix which is also suffix
<?php
// Efficient PHP program to find length of
// the longest prefix which is also suffix
$lps[$n] = NULL;
// lps[0] is always 0
$lps[0] = 0;
// (pat[i] != pat[len])
else
{
107
Chapter 20. Longest prefix which is also suffix
// if (len == 0)
else
{
$lps[$i] = 0;
$i++;
}
}
}
$res = $lps[$n-1];
// Driver Code
$s = "abcab";
echo longestPrefixSuffix($s);
Output:
Source
https://www.geeksforgeeks.org/longest-prefix-also-suffix/
108
Chapter 21
Input : 1234
Output : Possible 1
Explanation: String can be split as "1", "2",
"3", "4"
Input : 99100
Output :Possible 99
Explanation: String can be split as "99",
"100"
Input : 101103
Output : Not Possible
Explanation: It is not possible to split this
string under given constraint.
Approach : The idea is to take a substring from index 0 to any index i (i starting from 1)
of the numeric string and convert it to long data type. Add 1 to it and convert the increased
number back to string. Check if the next occurring substring is equal to the increased one.
If yes, then carry on the procedure else increase the value of i and repeat the steps.
Java
109
Chapter 21. Splitting a Numeric String
class GFG {
int flag = 0;
// if s2 is not a substring
// of number than not possile
if (k + l > len) {
flag = 1;
110
Chapter 21. Splitting a Numeric String
break;
}
else
flag = 1;
}
111
Chapter 21. Splitting a Numeric String
// Driver Code
public static void main(String args[])
{
Scanner in = new Scanner(System.in);
String str = "99100";
C#
class GFG
{
// Function accepts a
// string and checks if
// string can be split.
static void split(string str)
{
int len = str.Length;
// if there is only 1
// number in the string
// then it is not possible
// to split it
if (len == 1)
{
Console.WriteLine("Not Possible");
return;
}
int flag = 0;
112
Chapter 21. Splitting a Numeric String
// if s2 is not a substring
// of number than not possile
if (k + l > len)
{
flag = 1;
break;
}
// if s2 is the next
// substring of the
// numeric string
if ((str.Substring(k, l).Equals(s2)))
{
flag = 0;
113
Chapter 21. Splitting a Numeric String
else
flag = 1;
}
// if conditions
// failed to hold
else if (flag == 1 &&
i > len / 2 - 1)
{
Console.WriteLine("Not Possible");
break;
}
}
}
// Driver Code
static void Main()
{
string str = "99100";
Output:
114
Chapter 21. Splitting a Numeric String
Possible 99
Improved By : manishshaw1
Source
https://www.geeksforgeeks.org/splitting-numeric-string/
115
Chapter 22
Input : a ={
{D,D,D,G,D,D},
{B,B,D,E,B,S},
{B,S,K,E,B,K},
{D,D,D,D,D,E},
{D,D,D,D,D,E},
{D,D,D,D,D,G}
}
str= "GEEKS"
Output :2
Input : a = {
{B,B,M,B,B,B},
{C,B,A,B,B,B},
{I,B,G,B,B,B},
{G,B,I,B,B,B},
{A,B,C,B,B,B},
{M,C,I,G,A,M}
}
str= "MAGIC"
116
Chapter 22. Count of number of given string in 2D character array
Output :3
if (row >= 0 && row <= row_max && col >= 0 &&
col <= col_max && *needle == hay[row][col])
{
hay[row][col] = 0;
if (*needle == 0) {
found = 1;
117
Chapter 22. Count of number of given string in 2D character array
} else {
hay[row][col] = match;
}
return found;
}
return found;
}
// Driver code
int main(void){
118
Chapter 22. Count of number of given string in 2D character array
"MCIGAM"
};
char *str[ARRAY_SIZE(input)];
int i;
for (i = 0; i < ARRAY_SIZE(input); ++i) {
str[i] = malloc(strlen(input[i]));
strcpy(str[i], input[i]);
}
return 0;
}
Output:
Source
https://www.geeksforgeeks.org/find-count-number-given-string-present-2d-character-array/
119
Chapter 23
Naive Approach : Shift second string one by one and keep track the length of longest
prefix for each shift, there are total of n shifts and for each shift finding the length of common
prefix will take O(n) time. Hence, overall time complexity for this approach is O(n^2).
Better Approach : If we will add second string at the end of itself that is str2 = str2
+ str2 then there is no need of finding prefix for each shift separately. Now, after adding
str2 to itself we have to only find the longest prefix of str1 present in str2 and the starting
position of that prefix in str2 will give us the actual number of shift required. For finding
longest prefix we can use KMP pattern search algorithm.
So, in this way our time-complexity will reduces to O(n) only.
120
Chapter 23. Find minimum shift for longest common prefix
// print result
cout << "Shift = " << pos << endl;
cout << "Prefix = " << str1.substr(0, len);
}
// driver function
int main()
{
string str1 = "geeksforgeeks";
string str2 = "forgeeksgeeks";
121
Chapter 23. Find minimum shift for longest common prefix
int n = str1.size();
str2 = str2 + str2;
KMP(2 * n, n, str2, str1);
return 0;
}
Output:
Shift = 8
Prefix = geeksforgeeks
Source
https://www.geeksforgeeks.org/find-minimum-shift-longest-common-prefix/
122
Chapter 24
Frequency of a substring in a
string
Input : nn (pattern)
Banana (String)
Output : 0
Input : aa (pattern)
aaaaa (String)
Output : 4
A simple solution is to match characters one by one. And whenever we see a complete
match, we increment count. Below is simple solution based on Naive pattern searching.
123
Chapter 24. Frequency of a substring in a string
#include<bits/stdc++.h>
using namespace std;
Output :
124
Chapter 24. Frequency of a substring in a string
// in a text.
class KMP_String_Matching
{
int KMPSearch(String pat, String txt)
{
int M = pat.length();
int N = txt.length();
while (i < N)
{
if (pat.charAt(j) == txt.charAt(i))
{
j++;
i++;
}
if (j == M)
{
// When we find pattern first time,
// we iterate again to check if there
// exists more pattern
j = lps[j-1];
res++;
125
Chapter 24. Frequency of a substring in a string
126
Chapter 24. Frequency of a substring in a string
Output:
Source
https://www.geeksforgeeks.org/frequency-substring-string/
127
Chapter 25
Count of occurrences of a
“1(0+)1” pattern in a string
Input : 1001010001
Output : 3
First sequence is in between 0th and 3rd index.
Second sequence is in between 3rd and 5th index.
Third sequence is in between 5th and 9th index.
So total number of sequences comes out to be 3.
Input : 1001ab010abc01001
Output : 2
First sequence is in between 0th and 3rd index.
Second valid sequence is in between 13th and 16th
index. So total number of sequences comes out to
be 2.
The idea to solve this problem is to first find a ‘1’ and keep moving forward in the string
and check as mentioned below:
1. If any character other than ‘0’ and ‘1’ is obtained then it means pattern is not valid.
So we go on in the search of next ‘1’ from this index and repeat these steps again.
2. If a ‘1’ is seen, then check for the presence of ‘0’ at previous position to check the
validity of sequence.
128
Chapter 25. Count of occurrences of a “1(0+)1” pattern in a string
C++
return count;
}
129
Chapter 25. Count of occurrences of a “1(0+)1” pattern in a string
Java
130
Chapter 25. Count of occurrences of a “1(0+)1” pattern in a string
Python
return count
# Driver code
s = "100001abc101"
print countPattern(s)
Output:
131
Chapter 25. Count of occurrences of a “1(0+)1” pattern in a string
Source
https://www.geeksforgeeks.org/count-of-occurrences-of-a-101-pattern-in-a-string/
132
Chapter 26
Find all the patterns of ”1(0+)1” in a given string | SET 2(Regular Expression Approach)
- GeeksforGeeks
In Set 1, we have discussed general approach for counting the patterns of the form 1(0+)1
where (0+) represents any non-empty consecutive sequence of 0’s.In this post, we will discuss
regular expression approach to count the same.
Examples:
Input : 1101001
Output : 2
Input : 100001abc101
Output : 2
10+1
Hence, whenever we found a match, we increase counter for counting the pattern.As last
character of a match will always ‘1’, we have to again start searching from that index.
133
Chapter 26. Find all the patterns of “1(0+)1” in a given string | SET 2(Regular
Expression Approach)
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class GFG
{
static int patternCount(String str)
{
// regular expression for the pattern
String regex = "10+1";
// compiling regex
Pattern p = Pattern.compile(regex);
// Matcher object
Matcher m = p.matcher(str);
// counter
int counter = 0;
counter++;
}
return counter;
}
// Driver Method
public static void main (String[] args)
{
String str = "1001ab010abc01001";
System.out.println(patternCount(str));
}
}
Output:
134
Chapter 26. Find all the patterns of “1(0+)1” in a given string | SET 2(Regular
Expression Approach)
Related Articles :
Source
https://www.geeksforgeeks.org/find-patterns-101-given-string-set-2regular-expression-approach/
135
Chapter 27
Find all the patterns of ”1(0+)1” in a given string | SET 1(General Approach) - Geeks-
forGeeks
A string contains patterns of the form 1(0+)1 where (0+) represents any non-empty consec-
utive sequence of 0’s. Count all such patterns. The patterns are allowed to overlap.
Note : It contains digits and lowercase characters only. The string is not necessarily a
binary. 100201 is not a valid pattern.
One approach to solve the problem is discussed here, other using Regular expressions is
given in Set 2
Examples:
Input : 1101001
Output : 2
Input : 100001abc101
Output : 2
C++
136
Chapter 27. Find all the patterns of “1(0+)1” in a given string | SET 1(General
Approach)
int i = 1, counter = 0;
while (i < str.size())
{
/* We found 0 and last character was '1',
state change*/
if (str[i] == '0' && last == '1')
{
while (str[i] == '0')
i++;
return counter;
}
/* Driver Code */
int main()
{
string str = "1001ab010abc01001";
cout << patternCount(str) << endl;
return 0;
}
Java
137
Chapter 27. Find all the patterns of “1(0+)1” in a given string | SET 1(General
Approach)
class GFG
{
// Function to count patterns
static int patternCount(String str)
{
/* Variable to store the last character*/
char last = str.charAt(0);
int i = 1, counter = 0;
while (i < str.length())
{
/* We found 0 and last character was '1',
state change*/
if (str.charAt(i) == '0' && last == '1')
{
while (str.charAt(i) == '0')
i++;
return counter;
}
// Driver Code
public static void main (String[] args)
{
String str = "1001ab010abc01001";
System.out.println(patternCount(str));
}
}
Python3
138
Chapter 27. Find all the patterns of “1(0+)1” in a given string | SET 1(General
Approach)
def patternCount(str):
i = 1; counter = 0
while (i < len(str)):
return counter
# Driver Code
str = "1001ab010abc01001"
ans = patternCount(str)
print (ans)
C#
class GFG
{
139
Chapter 27. Find all the patterns of “1(0+)1” in a given string | SET 1(General
Approach)
int i = 1, counter = 0;
while (i < str.Length)
{
// We found 0 and last
// character was '1',
// state change
if (str[i] == '0' && last == '1')
{
while (str[i] == '0')
i++;
return counter;
}
// Driver Code
public static void Main ()
{
String str = "1001ab010abc01001";
Console.Write(patternCount(str));
}
}
PHP
<?php
// PHP Code to count 1(0+)1 patterns
// in a string
140
Chapter 27. Find all the patterns of “1(0+)1” in a given string | SET 1(General
Approach)
// last character
$last = $str[0];
$i = 1;
$counter = 0;
while ($i < strlen($str))
{
return $counter;
}
// Driver Code
$str = "1001ab010abc01001";
echo patternCount($str) ;
Output :
141
Chapter 27. Find all the patterns of “1(0+)1” in a given string | SET 1(General
Approach)
Source
https://www.geeksforgeeks.org/find-patterns-101-given-string/
142
Chapter 28
143
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
Explanation: In above example, we have got a substring t of text T matched with pattern
P (in green) before mismatch at index 2. Now we will search for occurrence of t (“AB”) in
P. We have found an occurrence starting at position 1 (in yellow background) so we will
right shift the pattern 2 times to align t in P with t in T. This is weak rule of original Boyer
Moore and not much effective, we will discuss a Strong Good Suffix rule shortly.
Case 2: A prefix of P, which matches with suffix of t in T
It is not always likely that we will find occurrence of t in P. Sometimes there is no occurrence
at all, in such cases sometime we can search for some suffix of t matching with some prefix
of P and try to align them by shifting P. For example –
144
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
Explanation: In above example, we have got t (“BAB”) matched with P (in green) at
index 2-4 before mismatch . But because there exist no occurrence of t in P we will search
for some some prefix of P which match with some suffix of t. We have found prefix “AB”
(in yellow background) starting at index 0 which matches not with whole t but suffix of t
“AB” starting at index 3. So now we will shift pattern 3 times to align prefix with suffix.
Case 3: P moves past t
If above two cases are not satisfied, we will shift the pattern past the t. For example –
145
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
Explanation: If above example, there exist no occurrence of t (“AB”) in P and also there
is no prefix in P which matches with suffix of t. So in that case we can never find any perfect
match before index 4, so we will shift the P past the t ie. to index 5.
Strong Good suffix Heuristic
Suppose substring q = P[i to n] got matched with t in T and c = P[i-1] is the mismatching
character. Now unlike case 1 we will search for t in P which is not preceded by character c.
The closest such occurrence is then aligned with t in T by shifting pattern P. For example –
146
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
147
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
The shift position is obtained by the borders which cannot be extended to the left. Following
is the code for preprocessing –
i--;
j--;
bpos[ i ] = j
But if character # at position i-1 do not match with character ? at position j-1 then we
continue our search to the right. Now we know that –
a. Border width will be smaller than the border starting at position j ie. smaller than
x…�
b. Border has to begin with # and end with � or could be empty (no border exist).
With above two facts we will continue our search in sub string x…� from position j to m.
The next border should be at j = bpos[j]. After updating j, we again compare character at
position j-1 (?) with # and if they are equal then we got our border otherwise we continue
our search to right until j>m. This process is shown by code –
148
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
pat[i-1] != pat[j-1]
This is the condition which we discussed in case 2. When the character preceding the
occurence of t in pattern P is different than mismatching character in P, we stop skipping
the occurences and shift the pattern. So here P[i] == P[j] but P[i-1] != p[j-1] so we shift
pattern from i to j. So shift[j] = j-i is recorder for j. So whenever any mismatch occur at
position j we will shift the pattern shift[j+1] positions to the right.
In above code the following condition is very important –
if (shift[j] == 0 )
This condition prevent modification of shift[j] value from suffix having same border. For
example, Consider pattern P = “addbddcdd”, when we calculate bpos[ i-1 ] for i = 4 then
j = 7 in this case. we will be eventually setting value of shift[ 7 ] = 3. Now if we calculate
bpos[ i-1 ] for i = 1 then j = 7 and we will be setting value shift[ 7 ] = 6 again if there is no
test shift[ j ] == 0. This mean if we have a mismatch at position 6 we will shift pattern P
3 positions to right not 6 position.
2) Preprocessing for Case 2
In the preprocessing for case 2, for each suffix the widest border of the whole pattern
that is contained in that suffix is determined.
The starting position of the widest border of the pattern at all is stored in bpos[0]
In the following preprocessing algorithm, this value bpos[0] is stored initially in all free
entries of array shift. But when the suffix of the pattern becomes shorter than bpos[0], the
algorithm continues with the next-wider border of the pattern, i.e. with bpos[j].
Following is the C implementation of search algorithm –
#include <stdio.h>
#include <string.h>
149
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
while(i>0)
{
/*if character at position i-1 is not equivalent to
character at j-1, then continue searching to right
of the pattern for border */
while(j<=m && pat[i-1] != pat[j-1])
{
/* the character preceding the occurence of t in
pattern P is different than mismatching character in P,
we stop skipping the occurences and shift the pattern
from i to j */
if (shift[j]==0)
shift[j] = j-i;
150
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
j = bpos[j];
}
}
//do preprocessing
preprocess_strong_suffix(shift, bpos, pat, m);
preprocess_case2(shift, bpos, pat, m);
j = m-1;
//Driver
151
Chapter 28. Boyer Moore Algorithm | Good Suffix heuristic
int main()
{
char text[] = "ABAAAABAACD";
char pat[] = "ABA";
search(text, pat);
return 0;
}
Output:
References
• http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm
Source
https://www.geeksforgeeks.org/boyer-moore-algorithm-good-suffix-heuristic/
152
Chapter 29
153
Chapter 29. is_permutation() in C++ and its application for anagram search
else
cout << "False\n";
return 0;
}
Output :
True
False
Application :
Given a pattern and a text, find all occurrences of pattern and its anagrams in text.
Examples:
We have discussed a (n) solution her. But in this post it is done using is_permutation().
Although the complexity is higher than previously discussed method, but the purpose is to
explain application of is_permutation().
Let size of pattern to be searched be pat_len. The idea is to traverse given text and for
every window of size pat_len, check if it is a permutation of given pattern or not.
154
Chapter 29. is_permutation() in C++ and its application for anagram search
return count;
}
// Driver code
int main()
{
string str = "forxxorfxdofr";
string pat = "for";
cout << countAnagrams(str, pat) << endl;
return 0;
}
Output:
Source
https://www.geeksforgeeks.org/is_permutation-c-application-anagram-search/
155
Chapter 30
Match Expression where a single special character in pattern can match one or more char-
acters - GeeksforGeeks
Given two string, in which one is pattern (Pattern) and other is searching expression. Search-
ing expression contains ‘#’.
The # works in following way:
Examples :
156
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
pat = "A#C#"
Output : yes
We can observe that whenever we encounter ‘#’, we have to consider as many characters
till the next character of pattern will not be equal to the current character of given string.
Firstly, we check if the current character of pattern is equal to ‘#’-
a) If not then we check whether the current character of string and pattern are same or
not, if same, then increment both counters else return false from here only. No need for
further checking.
b) If yes, then we have to find the position of a character in text that matches with next
character of pattern.
C++
157
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
i++;
j++;
}
return (j == lenText);
}
// Driver code
int main()
{
string str = "ABABABA";
string pat = "A#B#A";
if (regexMatch(str, pat))
cout << "yes";
else
cout << "no";
return 0;
}
Java
import java.util.*;
import java.lang.*;
import java.io.*;
class GFG
158
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
{
// Returns true if pat
// matches with text.
public static boolean regexMatch
(String text, String pat)
{
int lenText = text.length();
int lenPat = pat.length();
159
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
return (j == lenText);
}
// Driver code
public static void main (String[] args)
{
String str = "ABABABA";
String pat = "A#B#A";
if (regexMatch(str, pat))
System.out.println("yes");
else
System.out.println("no");
}
}
C#
class GFG
{
// Returns true if pat
// matches with text.
public static bool regexMatch
(String text, String pat)
{
int lenText = text.Length;
int lenPat = pat.Length;
160
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
if (Pat[i] != '#')
{
// If does not match with text.
if (Pat[i] != Text[j])
return false;
return (j == lenText);
}
// Driver code
public static void Main ()
{
String str = "ABABABA";
String pat = "A#B#A";
if (regexMatch(str, pat))
Console.Write("yes");
else
Console.Write("no");
}
}
PHP
161
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
<?php
// PHP program for pattern matching
// where a single special character
// can match one more characters
// If current character of
// pattern is not '#'
if ($pat[$i] != '#')
{
162
Chapter 30. Match Expression where a single special character in pattern can match one
or more characters
// Driver code
$str = "ABABABA";
$pat = "A#B#A";
if (regexMatch($str, $pat))
echo "yes";
else
echo "no";
Output:
yes
Source
https://www.geeksforgeeks.org/match-expression-where-a-single-special-character-in-pattern-can-match-one-or-mo
163
Chapter 31
Maximum length prefix of one string that occurs as subsequence in another - GeeksforGeeks
Given two strings s and t. The task is to find maximum length of some prefix of the string
S which occur in string t as subsequence.
Examples :
Input : s = "digger"
t = "biggerdiagram"
Output : 3
digger
biggerdiagram
Prefix "dig" of s is longest subsequence in t.
Input : s = "geeksforgeeks"
t = "agbcedfeitk"
Output : 4
A simple solutions is to consider all prefixes on by one and check if current prefix of s[] is
a subsequence of t[] or not. Finally return length of the largest prefix.
An efficient solution is based on the fact that to find a prefix of length n, we must first
find the prefix of length n – 1 and then look for s[n-1] in t. Similarly, to find a prefix of
length n – 1, we must first find the prefix of length n – 2 and then look for s[n – 2] and so
on.
Thus, we keep a counter which stores the current length of prefix found. We initialize it
with 0 and begin with the first letter of s and keep iterating over t to find the occurrence
164
Chapter 31. Maximum length prefix of one string that occurs as subsequence in another
of the first letter. As soon as we encounter the first letter of s we we update the counter
and look for second letter. We keep updating the counter and looking for next letter, until
either the string s is found or there are no more letters in t.
Below is the implementation of this approach:
C++
// Iterating string T.
for (int i = 0; i < strlen(t); i++)
{
// If end of string S.
if (count == strlen(s))
break;
// If character match,
// increment counter.
if (t[i] == s[count])
count++;
}
return count;
}
// Driven Code
int main()
{
char S[] = "digger";
char T[] = "biggerdiagram";
return 0;
}
165
Chapter 31. Maximum length prefix of one string that occurs as subsequence in another
Java
// Iterating string T.
for (int i = 0; i < t.length(); i++)
{
// If end of string S.
if (count == t.length())
break;
// If character match,
// increment counter.
if (t.charAt(i) == s.charAt(count))
count++;
}
return count;
}
// Driver Code
public static void main(String args[])
{
String S = "digger";
String T = "biggerdiagram";
System.out.println(maxPrefix(S, T));
}
}
// This code is contributed by Sumit Ghosh
Python 3
166
Chapter 31. Maximum length prefix of one string that occurs as subsequence in another
# Iterating string T.
for i in range(0,len(t)) :
# If end of string S.
if (count == len(s)) :
break
# If character match,
# increment counter.
if (t[i] == s[count]) :
count = count + 1
return count
# Driver Code
S = "digger"
T = "biggerdiagram"
print(maxPrefix(S, T))
C#
class GFG
{
167
Chapter 31. Maximum length prefix of one string that occurs as subsequence in another
{
int count = 0;
// Iterating string T.
for (int i = 0; i < t.Length; i++)
{
// If end of string S.
if (count == t.Length)
break;
// If character match,
// increment counter.
if (t[i] == s[count])
count++;
}
return count;
}
// Driver Code
public static void Main()
{
String S = "digger";
String T = "biggerdiagram";
Console.Write(maxPrefix(S, T));
}
}
PHP
<?php
// PHP program to find maximum
// length prefix of one string
// occur as subsequence in another
// string.
// Iterating string T.
for ($i = 0; $i < strlen($t); $i++)
{
168
Chapter 31. Maximum length prefix of one string that occurs as subsequence in another
// If end of string S.
if ($count == strlen($s))
break;
// If character match,
// increment counter.
if ($t[$i] == $s[$count])
$count++;
}
return $count;
}
// Driver Code
{
$S = "digger";
$T = "biggerdiagram";
return 0;
}
Output :
Source
https://www.geeksforgeeks.org/maximum-length-prefix-one-string-occurs-subsequence-another/
169
Chapter 32
Replace all occurrences of string AB with C without using extra space - GeeksforGeeks
Given a string str that may contain one more occurrences of “AB”. Replace all occurrences
of “AB” with “C” in str.
Examples:
A simple solution is to find all occurrences of “AB”. For every occurrence, replace it with
C and more all characters one position back.
C++
170
Chapter 32. Replace all occurrences of string AB with C without using extra space
// Driver code
int main()
{
char str[] = "helloABworldABGfG";
translate(str);
printf("The modified string is :\n");
printf("%s", str);
}
Java
class GFG {
171
Chapter 32. Replace all occurrences of string AB with C without using extra space
str[i - 1] = 'C';
int j;
for (j = i; j < str.length - 1; j++)
str[j] = str[j + 1];
str[j] = ' ';
}
}
return;
}
// Driver code
public static void main(String args[])
{
String st = "helloABworldABGfG";
char str[] = st.toCharArray();
translate(str);
System.out.println("The modified string is :");
System.out.println(str);
}
}
Python3
def translate(st) :
172
Chapter 32. Replace all occurrences of string AB with C without using extra space
return
# Driver code
st = list("helloABworldABGfG")
translate(st)
C#
class GFG {
}
}
return;
}
// Driver code
public static void Main()
{
String st = "helloABworldABGfG";
173
Chapter 32. Replace all occurrences of string AB with C without using extra space
Output :
C++
// Traverse string
while (j < len-1)
{
// Replace occurrence of "AB" with "C"
if (str[j] == 'A' && str[j+1] == 'B')
{
// Increment j by 2
j = j + 2;
174
Chapter 32. Replace all occurrences of string AB with C without using extra space
str[i++] = 'C';
continue;
}
str[i++] = str[j++];
}
if (j == len-1)
str[i++] = str[j];
// Driver code
int main()
{
char str[] = "helloABworldABGfG";
translate(str);
printf("The modified string is :\n");
printf("%s", str);
}
Java
class GFG {
// Traverse string
while (j < len - 1)
{
// Replace occurrence of "AB" with "C"
if (str[j] == 'A' && str[j + 1] == 'B')
{
175
Chapter 32. Replace all occurrences of string AB with C without using extra space
// Increment j by 2
j = j + 2;
str[i++] = 'C';
continue;
}
str[i++] = str[j++];
}
if (j == len - 1)
str[i++] = str[j];
// Driver code
public static void main(String args[])
{
String st="helloABworldABGfG";
char str[] = st.toCharArray();
translate(str);
System.out.println("The modified string is :");
System.out.println(str);
}
}
Python3
def translate(st) :
l = len(st)
if (l < 2) :
return
# Traverse string
while (j < l - 1) :
176
Chapter 32. Replace all occurrences of string AB with C without using extra space
# Increment j by 2
j += 2
st[i] = 'C'
i += 1
continue
st[i] = st[j]
i += 1
j += 1
if (j == l - 1) :
st[i] = st[j]
i += 1
# Driver code
st = list("helloABworldABGfG")
translate(st)
C#
class GFG {
177
Chapter 32. Replace all occurrences of string AB with C without using extra space
// Traverse string
while (j < len - 1)
{
if (j == len - 1)
str[i++] = str[j];
// Driver code
public static void Main()
{
String st="helloABworldABGfG";
char []str = st.ToCharArray();
translate(str);
Console.Write("The modified string is :");
Console.Write(str);
}
}
Output:
178
Chapter 32. Replace all occurrences of string AB with C without using extra space
Source
https://www.geeksforgeeks.org/replace-occurrences-string-ab-c-without-using-extra-space/
179
Chapter 33
Text = "baaabab",
Pattern = “*****ba*****ab", output : true
Pattern = "baaa?ab", output : true
Pattern = "ba*a?", output : true
Pattern = "a*ab", output : false
180
Chapter 33. Wildcard Pattern Matching
Each occurrence of ‘?’ character in wildcard pattern can be replaced with any other character
and each occurrence of ‘*’ with a sequence of characters such that the wildcard pattern
becomes identical to the input string after replacement.
Let’s consider any character in the pattern.
Case 1: The character is ‘*’
Here two cases arise
1. We can ignore ‘*’ character and move to next character in the Pattern.
2. ‘*’ character matches with one or more characters in Text. Here we will move to next
character in the string.
181
Chapter 33. Wildcard Pattern Matching
// pattern is null
T[i][0] = false;
// text is null
T[0][j] = T[0][j - 1] if pattern[j – 1] is '*'
DP relation :
C++
182
Chapter 33. Wildcard Pattern Matching
#include <bits/stdc++.h>
using namespace std;
183
Chapter 33. Wildcard Pattern Matching
return lookup[n][m];
}
int main()
{
char str[] = "baaabab";
char pattern[] = "*****ba*****ab";
// char pattern[] = "ba*****ab";
// char pattern[] = "ba*ab";
// char pattern[] = "a*ab";
// char pattern[] = "a*****ab";
// char pattern[] = "*a*****ab";
// char pattern[] = "ba*ab****";
// char pattern[] = "****";
// char pattern[] = "*";
// char pattern[] = "aa?ab";
// char pattern[] = "b*b";
// char pattern[] = "a*a";
// char pattern[] = "baaabab";
// char pattern[] = "?baaabab";
// char pattern[] = "*baaaba*";
return 0;
}
Java
184
Chapter 33. Wildcard Pattern Matching
185
Chapter 33. Wildcard Pattern Matching
return lookup[n][m];
}
}
}
// This code is contributed by Sumit Ghosh
Output :
Yes
Time complexity of above solution is O(m x n). Auxiliary space used is also O(m x n).
Further Improvements:
We can improve space complexity by making use of the fact that we only uses the result
from last row.
One more improvement is yo merge consecutive ‘*’ in the pattern to single ‘*’ as they mean
186
Chapter 33. Wildcard Pattern Matching
the same thing. For example for pattern “*****ba*****ab”, if we merge consecutive stars,
the resultant string will be “*ba*ab”. So, value of m is reduced from 14 to 6.
Source
https://www.geeksforgeeks.org/wildcard-pattern-matching/
187
Chapter 34
Input:
mat[ROW][COL]= { {'B', 'N', 'E', 'Y', 'S'},
{'H', 'E', 'D', 'E', 'S'},
{'S', 'G', 'N', 'D', 'E'}
};
Word = “DES”
Output:
D(1, 2) E(1, 1) S(2, 0)
D(1, 2) E(1, 3) S(0, 4)
D(1, 2) E(1, 3) S(1, 4)
D(2, 3) E(1, 3) S(0, 4)
D(2, 3) E(1, 3) S(1, 4)
D(2, 3) E(2, 4) S(1, 4)
Input:
char mat[ROW][COL] = { {'B', 'N', 'E', 'Y', 'S'},
{'H', 'E', 'D', 'E', 'S'},
{'S', 'G', 'N', 'D', 'E'}};
char word[] ="BNEGSHBN";
188
Chapter 34. Find all occurrences of a given word in a matrix
Output:
B(0, 0) N(0, 1) E(1, 1) G(2, 1) S(2, 0) H(1, 0)
B(0, 0) N(0, 1)
We strongly recommend you to minimize your browser and try this yourself
first.
This is mainly an extension of this post. Here with locations path is also printed.
The problem can be easily solved by applying DFS() on each occurrence of first character
of the word in the matrix. A cell in 2D matrix can be connected to 8 neighbours. So, unlike
standard DFS(), where we recursively call for all adjacent vertices, here we can recursive
call for 8 neighbours only.
189
Chapter 34. Find all occurrences of a given word in a matrix
#include <bits/stdc++.h>
using namespace std;
#define ROW 3
#define COL 5
190
Chapter 34. Find all occurrences of a given word in a matrix
return 0;
}
Output :
191
Chapter 34. Find all occurrences of a given word in a matrix
Source
https://www.geeksforgeeks.org/find-all-occurrences-of-the-word-in-a-matrix/
192
Chapter 35
Output:
Word his appears from 1 to 3
Word he appears from 4 to 5
Word she appears from 3 to 5
Word hers appears from 4 to 7
If we use a linear time searching algorithm like KMP, then we need to one by one search
all words in text[]. This gives us total time complexity as O(n + length(word[0]) + O(n +
length(word[1]) + O(n + length(word[2]) + … O(n + length(word[k-1]). This time complex-
ity can be written as O(n*k + m).
Aho-Corasick Algorithm finds all words in O(n + m + z) time where z is total number
of occurrences of words in text. The Aho–Corasick string matching algorithm formed the
basis of the original Unix command fgrep.
1. Prepocessing : Build an automaton of all words in arr[] The automaton has mainly
three functions:
193
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
2. Matching : Traverse the given text over built automaton to find all matching words.
Preprocessing:
194
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
Go to :
We build Trie. And for all characters which don’t have an edge at root, we add an edge
back to root.
Failure :
For a state s, we find the longest proper suffix which is a proper prefix of some pattern. This
is done using Breadth First Traversal of Trie.
Output :
For a state s, indexes of all words ending at s are stored. These indexes are stored as bitwise
map (by doing bitwise OR of values). This is also computing using Breadth First Traversal
with Failure.
Below is C++ implementation of Aho-Corasick Algorithm
195
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
currentState = g[currentState][ch];
}
196
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
197
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
failure = g[failure][ch];
f[g[state][ch]] = failure;
return states;
}
// Returns the next state the machine will transition to using goto
// and failure functions.
// currentState - The current state of the machine. Must be between
// 0 and the number of states - 1, inclusive.
// nextInput - The next character that enters into the machine.
int findNextState(int currentState, char nextInput)
{
int answer = currentState;
int ch = nextInput - 'a';
return g[answer][ch];
}
198
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
searchWords(arr, k, text);
return 0;
}
Output:
Source:
http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
This article is contributed by Ayush Govil. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Improved By : PawelWolowiec
199
Chapter 35. Aho-Corasick Algorithm for Pattern Searching
Source
https://www.geeksforgeeks.org/aho-corasick-algorithm-pattern-searching/
200
Chapter 36
kasai’s Algorithm for Construction of LCP array from Suffix Array - GeeksforGeeks
Background
Suffix Array : A suffix array is a sorted array of all suffixes of a given string.
Let the given string be “banana”.
0 banana 5 a
1 anana Sort the Suffixes 3 ana
2 nana ----------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5 a 2 nana
201
Chapter 36. kasai’s Algorithm for Construction of LCP array from Suffix Array
LCP Array is an array of size n (like Suffix Array). A value lcp[i] indicates
length of the longest common prefix of the suffixes inexed by suffix[i] and suf-
fix[i+1]. suffix[n-1] is not defined as there is no suffix after it.
txt[0..n-1] = "banana"
suffix[] = {5, 3, 1, 0, 4, 2|
lcp[] = {1, 3, 0, 0, 2, 0}
202
Chapter 36. kasai’s Algorithm for Construction of LCP array from Suffix Array
203
Chapter 36. kasai’s Algorithm for Construction of LCP array from Suffix Array
{
// If first rank and next ranks are same as that of previous
// suffix in array, assign the same new rank to this suffix
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else // Otherwise increment rank and assign
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
204
Chapter 36. kasai’s Algorithm for Construction of LCP array from Suffix Array
205
Chapter 36. kasai’s Algorithm for Construction of LCP array from Suffix Array
// Driver program
int main()
{
string str = "banana";
Output:
Suffix Array :
5 3 1 0 4 2
LCP Array :
1 3 0 0 2 0
Illustration:
206
Chapter 36. kasai’s Algorithm for Construction of LCP array from Suffix Array
Next we compute LCP of second suffix which “anana“. Next suffix of “anana” in suffix
array is “banana”. Since there is no common prefix, the value of LCP for “anana” is 0 and
it is at index 2 in suffix array, so we fill lcp[2] as 0.
Next we compute LCP of third suffix which “nana“. Since there is no next suffix, the value
of LCP for “nana” is not defined. We fill lcp[5] as 0.
Next suffix in text is “ana”. Next suffix of “ana” in suffix array is “anana”. Since there is a
common prefix of length 3, the value of LCP for “ana” is 3. We fill lcp[1] as 3.
Now we lcp for next suffix in text which is “na“. This is where Kasai’s algorithm uses the
trick that LCP value must be at least 2 because previous LCP value was 3. Since there is
no character after “na”, final value of LCP is 2. We fill lcp[4] as 2.
Next suffix in text is “a“. LCP value must be at least 1 because previous value was 2. Since
there is no character after “a”, final value of LCP is 1. We fill lcp[0] as 1.
We will soon be discussing implementation of search with the help of LCP array and how
LCP array helps in reducing time complexity to O(m + Log n).
References:
http://web.stanford.edu/class/cs97si/suffix-array.pdf
http://www.mi.fu-berlin.de/wiki/pub/ABI/RnaSeqP4/suffix-array.pdf
http://codeforces.com/blog/entry/12796
This article is contributed by Prakhar Agrawal. Please write comments if you find any-
thing incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/%c2%ad%c2%adkasais-algorithm-for-construction-of-lcp-array-from-suffix-array/
207
Chapter 37
208
Chapter 37. Search a Word in a 2D Grid of characters
Below diagram shows a bigger grid and presence of different words in it.
209
Chapter 37. Search a Word in a 2D Grid of characters
210
Chapter 37. Search a Word in a 2D Grid of characters
// Driver program
int main()
{
char grid[R][C] = {"GEEKSFORGEEKS",
"GEEKSQUIZGEEK",
"IDEQAPRACTICE"
};
patternSearch(grid, "GEEKS");
cout << endl;
patternSearch(grid, "EEE");
return 0;
}
Output:
pattern found at 0, 0
pattern found at 0, 8
pattern found at 1, 0
pattern found at 0, 2
pattern found at 0, 10
pattern found at 2, 2
pattern found at 2, 12
Exercise: The above solution only print locations of word. Extend it to print the direction
where word is present.
See this for solution of exercise.
This article is contributed by Utkarsh Trivedi. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/search-a-word-in-a-2d-grid-of-characters/
211
Chapter 38
Example:
Index 0 1 2 3 4 5 6 7 8 9 10 11
Text a a b c a a b x a a a z
Z values X 1 0 0 3 1 0 0 2 2 1 0
More Examples:
str = "aaaaaa"
Z[] = {x, 5, 4, 3, 2, 1}
str = "aabaacd"
Z[] = {x, 1, 0, 2, 1, 0, 0}
str = "abababab"
Z[] = {x, 0, 6, 0, 4, 0, 2, 0}
212
Chapter 38. Z algorithm (Linear time pattern searching Algorithm)
Example:
Pattern P = "aab", Text T = "baabaa"
The idea is to maintain an interval [L, R] which is the interval with max R
such that [L,R] is prefix substring (substring which is also prefix).
2) If i <= R then let K = i-L, now Z[i] >= min(Z[K], R-i+1) because
str[i..] matches with str[K..] for atleast R-i+1 characters (they are in
[L,R] interval which we know is a prefix substring).
Now two sub cases arise –
a) If Z[K] < R-i+1 then there is no prefix substring starting at
str[i] (otherwise Z[K] would be larger) so Z[i] = Z[K] and
interval [L,R] remains same.
b) If Z[K] >= R-i+1 then it is possible to extend the [L,R] interval
thus we will set L as i and start matching from str[R] onwards and
get new R then we will update interval [L,R] and calculate Z[i] (=R-L+1).
For better understanding of above step by step procedure please check this animation –
http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm
213
Chapter 38. Z algorithm (Linear time pattern searching Algorithm)
The algorithm runs in linear time because we never compare character less than R and with
matching we increase R by one so there are at most T comparisons. In mismatch case,
mismatch happen only once for each i (because of which R stops), that’s another at most T
comparison making overall linear complexity.
Below is the implementation of Z algorithm for pattern searching.
C++
// Construct Z array
int Z[l];
getZarr(concat, Z);
214
Chapter 38. Z algorithm (Linear time pattern searching Algorithm)
if (i > R)
{
L = R = i;
// Driver program
int main()
{
string text = "GEEKS FOR GEEKS";
string pattern = "GEEK";
search(text, pattern);
215
Chapter 38. Z algorithm (Linear time pattern searching Algorithm)
return 0;
}
Java
int l = concat.length();
// Construct Z array
getZarr(concat, Z);
if(Z[i] == pattern.length()){
System.out.println("Pattern found at index "
+ (i - pattern.length() - 1));
}
}
}
int n = str.length();
216
Chapter 38. Z algorithm (Linear time pattern searching Algorithm)
L = R = i;
Z[i] = R - L;
R--;
}
else{
Z[i] = R - L;
R--;
}
}
}
}
217
Chapter 38. Z algorithm (Linear time pattern searching Algorithm)
// Driver program
public static void main(String[] args)
{
String text = "GEEKS FOR GEEKS";
String pattern = "GEEK";
search(text, pattern);
}
}
Output:
This article is contributed by Utkarsh Trivedi. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Improved By : pkoli
Source
https://www.geeksforgeeks.org/z-algorithm-linear-time-pattern-searching-algorithm/
218
Chapter 39
Let input string be str[0..n-1]. A Simple Solution is to do following for every character
str[i] in input string. Check if substring str[0…i] is palindrome, then print yes, else print no.
219
Chapter 39. Online algorithm for checking palindrome in a stream
A Better Solution is to use the idea of Rolling Hash used in Rabin Karp algorithm. The
idea is to keep track of reverse of first half and second half (we also use first half and reverse
of second half) for every index. Below is complete algorithm.
220
Chapter 39. Online algorithm for checking palindrome in a stream
i=4
a) ‘firstr’ and ‘second’ match, compare the whole strings, they match, so print yes
b) We don’t need to calculate next hash values as this is last index
The idea of using rolling hashes is, next hash value can be calculated from previous in O(1)
time by just doing some constant number of arithmetic operations.
Below are the implementations of above approach.
C/C++
int h = 1, i, j;
221
Chapter 39. Online algorithm for checking palindrome in a stream
222
Chapter 39. Online algorithm for checking palindrome in a stream
Java
int h = 1, i, j;
223
Chapter 39. Online algorithm for checking palindrome in a stream
if (str.charAt(j) != str.charAt(i
- j))
break;
}
System.out.println((j == i/2) ?
str.charAt(i) + " Yes": str.charAt(i)+
" No");
}
else System.out.println(str.charAt(i)+ " No");
224
Chapter 39. Online algorithm for checking palindrome in a stream
checkPalindromes(txt);
}
}
// This code is contributed by Sumit Ghosh
Python
def checkPalindromes(string):
h = 1
i = 0
j = 0
225
Chapter 39. Online algorithm for checking palindrome in a stream
for j in xrange(0,i/2):
if string[j] != string[i-j]:
break
j += 1
if j == i/2:
print string[i] + " Yes"
else:
print string[i] + " No"
else:
print string[i] + " No"
# Driver program
txt = "aabaacaabaa"
checkPalindromes(txt)
# This code is contributed by Bhavya Jain
Output:
a Yes
a Yes
b No
a No
a Yes
226
Chapter 39. Online algorithm for checking palindrome in a stream
c No
a No
a No
b No
a No
a Yes
The worst case time complexity of the above solution remains O(n*n), but in general, it
works much better than simple approach as we avoid complete substring comparison most
of the time by first comparing hash values. The worst case occurs for input strings with all
same characters like “aaaaaa”.
This article is contributed by Ajay. Please write comments if you find anything incorrect,
or you want to share more information about the topic discussed above.
Source
https://www.geeksforgeeks.org/online-algorithm-for-checking-palindrome-in-a-stream/
227
Chapter 40
Can you see why we say that LCS in R and S must be from same position in S ?
Let’s look at following examples:
• For S = xababayz and R = zyababax, LCS and LPS both are ababa (SAME)
• For S = abacdfgdcaba and R = abacdgfdcaba, LCS is abacd and LPS is aba (DIFFER-
ENT)
• For S = pqrqpabcdfgdcba and R = abcdgfdcbapqrqp, LCS and LPS both are pqrqp
(SAME)
• For S = pqqpabcdfghfdcba and R = abcdfhgfdcbapqqp, LCS is abcdf and LPS is pqqp
(DIFFERENT)
We can see that LCS and LPS are not same always. When they are different ?
When S has a reversed copy of a non-palindromic substring in it which is of same or longer
length than LPS in S, then LCS and LPS will be different.
In 2nd example above (S = abacdfgdcaba), for substring abacd, there exists a reverse copy
228
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
dcaba in S, which is of longer length than LPS aba and so LPS and LCS are different here.
Same is the scenario in 4th example.
To handle this scenario we say that LPS in S is same as LCS in S and R given that LCS
in R and S must be from same position in S.
If we look at 2nd example again, substring aba in R comes from exactly same position in S
as substring aba in S which is ZERO (0th index) and so this is LPS.
The Position Constraint:
We will refer string S index as forward index (Si ) and string R index as reverse index (Ri ).
Based on above figure, a character with index i (forward index) in a string S of length N,
will be at index N-1-i (reverse index) in it’s reversed string R.
If we take a substring of length L in string S with starting index i and ending index j (j =
i+L-1), then in it’s reversed string R, the reversed substring of the same will start at index
N-1-j and will end at index N-1-i.
If there is a common substring of length L at indices Si (forward index) and Ri (reverse
index) in S and R, then these will come from same position in S if Ri = (N – 1) – (Si +
L – 1) where N is string length.
So to find LPS of string S, we find longest common string of S and R where both substrings
satisfy above constraint, i.e. if substring in S is at index Si , then same substring should be
in R at index (N – 1) – (Si + L – 1). If this is not the case, then this substring is not LPS
candidate.
Naive [O(N*M2 )] and Dynamic Programming [O(N*M)] approaches to find LCS of two
strings are already discussed here which can be extended to add position constraint to give
LPS of a given string.
Now we will discuss suffix tree approach which is nothing but an extension to Suffix Tree
LCS approach where we will add the position constraint.
While finding LCS of two strings X and Y, we just take deepest node marked as XY (i.e.
the node which has suffixes from both strings as it’s children).
While finding LPS of string S, we will again find LCS of S and R with a condition that
the common substring should satisfy the position constraint (the common substring should
come from same position in S). To verify position constraint, we need to know all forward
and reverse indices on each internal node (i.e. the suffix indices of all leaf children below
the internal nodes).
In Generalized Suffix Tree of S#R$, a substring on the path from root to an internal node is
a common substring if the internal node has suffixes from both strings S and R. The index
of the common substring in S and R can be found by looking at suffix index at respective
leaf node.
If string S# is of length N then:
229
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
• If suffix index of a leaf is less than N, then that suffix belongs to S and same suffix
index will become forward index of all ancestor nodes
• If suffix index of a leaf is greater than N, then that suffix belongs to R and reverse
index for all ancestor nodes will be N – suffix index
Let’s take string S = cabbaabb. The figure below is Generalized Suffix Tree for cab-
baabb#bbaabbac$ where we have shown forward and reverse indices of all children suffixes
on all internal nodes (except root).
Forward indices are in Parentheses () and reverse indices are in square bracket [].
In above figure, all leaf nodes will have one forward or reverse index depending on which
string (S or R) they belong to. Then children’s forward or reverse indices propagate to the
parent.
Look at the figure to understand what would be the forward or reverse index on a leaf with
a given suffix index. At the bottom of figure, it is shown that leaves with suffix indices from
0 to 8 will get same values (0 to 8) as their forward index in S and leaves with suffix indices
9 to 17 will get reverse index in R from 0 to 8.
For example, the highlighted internal node has two children with suffix indices 2 and 9. Leaf
with suffix index 2 is from position 2 in S and so it’s forward index is 2 and shown in ().
Leaf with suffix index 9 is from position 0 in R and so it’s reverse index is 0 and shown in [].
These indices propagate to parent and the parent has one leaf with suffix index 14 for which
reverse index is 4. So on this parent node forward index is (2) and reverse index is [0,4].
230
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
And in same way, we should be able to understand the how forward and reverse indices are
calculated on all nodes.
In above figure, all internal nodes have suffixes from both strings S and R, i.e. all of them
represent a common substring on the path from root to themselves. Now we need to find
deepest node satisfying position constraint. For this, we need to check if there is a forward
index Si on a node, then there must be a reverse index Ri with value (N – 2) – (Si + L –
1) where N is length of string S# and L is node depth (or substring length). If yes, then
consider this node as a LPS candidate, else ignore it. In above figure, deepest node is
highlighted which represents LPS as bbaabb.
We have not shown forward and reverse indices on root node in figure. Because root node
itself doesn’t represent any common substring (In code implementation also, forward and
reverse indices will not be calculated on root node)
How to implement this apprach to find LPS? Here are the things that we need:
• If we store indices in array, it will require linear search which will make overall approach
non-linear in time.
• If we store indices in tree (set in C++, TreeSet in Java), we may use binary search
but still overall approach will be non-linear in time.
• If we store indices in hash function based set (unordered_set in C++, HashSet in
Java), it will provide a constant search on average and this will make overall approach
linear in time. A hash function based set may take more space depending on values
being stored.
We will use two unordered_set (one for forward and other from reverse indices) in our
implementation, added as a member variable in SuffixTreeNode structure.
231
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
#include <iostream>
#include <unordered_set>
#define MAX_CHAR 256
using namespace std;
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
232
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
233
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
if (activeLength == 0)
activeEdge = pos; //APCFALZ
234
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
235
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
next->start += activeLength;
split->children] = next;
236
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
237
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
*(n->end) = i;
}
}
n->suffixIndex = size - labelHeight;
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
238
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
extendSuffixTree(i);
int labelHeight = 0;
setSuffixIndexByDFS(root, labelHeight);
}
239
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
}
}
}
void getLongestPalindromicSubstring()
{
int maxHeight = 0;
int substringStartIndex = 0;
doTraversal(root, 0, &maxHeight, &substringStartIndex);
int k;
for (k=0; k<maxHeight; k++)
printf("%c", text[k + substringStartIndex]);
if(k == 0)
printf("No palindromic substring");
else
printf(", of length: %d",maxHeight);
printf("\n");
}
size1 = 17;
printf("Longest Palindromic Substring in forgeeksskeegfor is: ");
strcpy(text, "forgeeksskeegfor#rofgeeksskeegrof$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf("Longest Palindromic Substring in abcde is: ");
strcpy(text, "abcde#edcba$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 7;
printf("Longest Palindromic Substring in abcdae is: ");
strcpy(text, "abcdae#eadcba$"); buildSuffixTree();
getLongestPalindromicSubstring();
240
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
size1 = 6;
printf("Longest Palindromic Substring in abacd is: ");
strcpy(text, "abacd#dcaba$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf("Longest Palindromic Substring in abcdc is: ");
strcpy(text, "abcdc#cdcba$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 13;
printf("Longest Palindromic Substring in abacdfgdcaba is: ");
strcpy(text, "abacdfgdcaba#abacdgfdcaba$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 15;
printf("Longest Palindromic Substring in xyabacdfgdcaba is: ");
strcpy(text, "xyabacdfgdcaba#abacdgfdcabayx$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 9;
printf("Longest Palindromic Substring in xababayz is: ");
strcpy(text, "xababayz#zyababax$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf("Longest Palindromic Substring in xabax is: ");
strcpy(text, "xabax#xabax$"); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
return 0;
}
241
Chapter 40. Suffix Tree Application 6 – Longest Palindromic Substring
Output:
Followup:
Detect ALL palindromes in a given string.
e.g. For string abcddcbefgf, all possible palindromes are a, b, c, d, e, f, g, dd, fgf, cddc,
bcddcb.
We have published following more articles on suffix tree applications:
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/suffix-tree-application-6-longest-palindromic-substring/
242
Chapter 41
If we look at all four cases, we will see that we 1st set minimum of L[iMirror] and R-i to
L[i] and then we try to expand the palindrome in whichever case it can expand.
Above observation may look more intuitive, easier to understand and implement, given
that one understands LPS length array, position, index, symmetry property etc.
243
Chapter 41. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 4
C/C++
char text[100];
int min(int a, int b)
{
int res = a;
if(b < a)
res = b;
return res;
}
void findLongestPalindromicString()
{
int N = strlen(text);
if(N == 0)
return;
N = 2*N + 1; //Position count
int L[N]; //LPS Length Array
L[0] = 0;
L[1] = 1;
int C = 1; //centerPosition
int R = 2; //centerRightPosition
int i = 0; //currentRightPosition
int iMirror; //currentLeftPosition
int maxLPSLength = 0;
int maxLPSCenterPosition = 0;
int start = -1;
int end = -1;
int diff = -1;
244
Chapter 41. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 4
strcpy(text, "babcbabcbaccba");
findLongestPalindromicString();
strcpy(text, "abaaba");
findLongestPalindromicString();
strcpy(text, "abababa");
findLongestPalindromicString();
245
Chapter 41. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 4
strcpy(text, "abcbabcbabcba");
findLongestPalindromicString();
strcpy(text, "forgeeksskeegfor");
findLongestPalindromicString();
strcpy(text, "caba");
findLongestPalindromicString();
strcpy(text, "abacdfgdcaba");
findLongestPalindromicString();
strcpy(text, "abacdfgdcabba");
findLongestPalindromicString();
strcpy(text, "abacdedcaba");
findLongestPalindromicString();
return 0;
}
Python
def findLongestPalindromicString(text):
N = len(text)
if N == 0:
return
N = 2*N+1 # Position count
L = [0] * N
L[0] = 0
L[1] = 1
C = 1 # centerPosition
R = 2 # centerRightPosition
i = 0 # currentRightPosition
iMirror = 0 # currentLeftPosition
maxLPSLength = 0
maxLPSCenterPosition = 0
start = -1
end = -1
diff = -1
246
Chapter 41. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 4
iMirror = 2*C-i
L[i] = 0
diff = R - i
# If currentRightPosition i is within centerRightPosition R
if diff > 0:
L[i] = min(L[iMirror], diff)
# Driver program
text1 = "babcbabcbaccba"
findLongestPalindromicString(text1)
text2 = "abaaba"
findLongestPalindromicString(text2)
text3 = "abababa"
findLongestPalindromicString(text3)
247
Chapter 41. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 4
text4 = "abcbabcbabcba"
findLongestPalindromicString(text4)
text5 = "forgeeksskeegfor"
findLongestPalindromicString(text5)
text6 = "caba"
findLongestPalindromicString(text6)
text7 = "abacdfgdcaba"
findLongestPalindromicString(text7)
text8 = "abacdfgdcabba"
findLongestPalindromicString(text8)
text9 = "abacdedcaba"
findLongestPalindromicString(text9)
Output:
Other Approaches
We have discussed two approaches here. One in Part 3 and other in current article. In
both approaches, we worked on given string. Here we had to handle even and odd positions
differently while comparing characters for expansion (because even positions do not represent
any character in string).
To avoid this different handling of even and odd positions, we need to make even positions
also to represent some character (actually all even positions should represent SAME char-
acter because they MUST match while character comparison). One way to do this is to set
some character at all even positions by modifying given string or create a new copy of given
string. For example, if input string is “abcb”, new string should be “#a#b#c#b#” if we
add # as unique character at even positions.
The two approaches discussed already can be modified a bit to work on modified string
where different handling of even and odd positions will not be needed.
248
Chapter 41. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 4
We may also add two DIFFERENT characters (not yet used anywhere in string at even
and odd positions) at start and end of string as sentinels to avoid bound check. With these
changes string “abcb” will look like “^#a#b#c#b#$” where ^ and $ are sentinels.
This implementation may look cleaner with the cost of more memory.
We are not implementing these here as it’s a simple change in given implementations.
Implementation of approach discussed in current article on a modified string can be found
at Longest Palindromic Substring Part II and a Java Translation of the same by Princeton.
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/manachers-algorithm-linear-time-longest-palindromic-substring-part-4/
249
Chapter 42
If at all we need a comparison, we will only compare actual characters, which are at “odd”
positions like 1, 3, 5, 7, etc.
Even positions do not represent a character in string, so no comparison will be preformed
for even positions.
If two characters at different odd positions match, then they will increase LPS length by 2.
There are many ways to implement this depending on how even and odd positions are
handled. One way would be to create a new string 1st where we insert some unique character
(say #, $ etc) in all even positions and then run algorithm on that (to avoid different way
of even and odd position handling). Other way could be to work on given string itself but
here even and odd positions should be handled appropriately.
Here we will start with given string itself. When there is a need of expansion and character
comparison required, we will expand in left and right positions one by one. When odd
position is found, comparison will be done and LPS Length will be incremented by ONE.
250
Chapter 42. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 3
When even position is found, no comparison done and LPS Length will be incremented by
ONE (So overall, one odd and one even positions on both left and right side will increase
LPS Length by TWO).
C/C++
char text[100];
void findLongestPalindromicString()
{
int N = strlen(text);
if(N == 0)
return;
N = 2*N + 1; //Position count
int L[N]; //LPS Length Array
L[0] = 0;
L[1] = 1;
int C = 1; //centerPosition
int R = 2; //centerRightPosition
int i = 0; //currentRightPosition
int iMirror; //currentLeftPosition
int expand = -1;
int diff = -1;
int maxLPSLength = 0;
int maxLPSCenterPosition = 0;
int start = -1;
int end = -1;
251
Chapter 42. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 3
L[i] = L[iMirror];
expand = 1; // expansion required
}
else if(L[iMirror] > diff) // Case 4
{
L[i] = diff;
expand = 1; // expansion required
}
}
else
{
L[i] = 0;
expand = 1; // expansion required
}
if (expand == 1)
{
//Attempt to expand palindrome centered at currentRightPosition i
//Here for odd positions, we compare characters and
//if match then increment LPS Length by ONE
//If even position, we just increment LPS by ONE without
//any character comparison
while (((i + L[i]) < N && (i - L[i]) > 0) &&
( ((i + L[i] + 1) % 2 == 0) ||
(text[(i + L[i] + 1)/2] == text[(i-L[i]-1)/2] )))
{
L[i]++;
}
}
252
Chapter 42. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 3
strcpy(text, "babcbabcbaccba");
findLongestPalindromicString();
strcpy(text, "abaaba");
findLongestPalindromicString();
strcpy(text, "abababa");
findLongestPalindromicString();
strcpy(text, "abcbabcbabcba");
findLongestPalindromicString();
strcpy(text, "forgeeksskeegfor");
findLongestPalindromicString();
strcpy(text, "caba");
findLongestPalindromicString();
strcpy(text, "abacdfgdcaba");
findLongestPalindromicString();
strcpy(text, "abacdfgdcabba");
findLongestPalindromicString();
strcpy(text, "abacdedcaba");
findLongestPalindromicString();
return 0;
}
Python
def findLongestPalindromicString(text):
253
Chapter 42. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 3
N = len(text)
if N == 0:
return
N = 2*N+1 # Position count
L = [0] * N
L[0] = 0
L[1] = 1
C = 1 # centerPosition
R = 2 # centerRightPosition
i = 0 # currentRightPosition
iMirror = 0 # currentLeftPosition
maxLPSLength = 0
maxLPSCenterPosition = 0
start = -1
end = -1
diff = -1
254
Chapter 42. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 3
# Driver program
text1 = "babcbabcbaccba"
findLongestPalindromicString(text1)
text2 = "abaaba"
findLongestPalindromicString(text2)
text3 = "abababa"
findLongestPalindromicString(text3)
text4 = "abcbabcbabcba"
findLongestPalindromicString(text4)
text5 = "forgeeksskeegfor"
findLongestPalindromicString(text5)
text6 = "caba"
findLongestPalindromicString(text6)
text7 = "abacdfgdcaba"
findLongestPalindromicString(text7)
text8 = "abacdfgdcabba"
findLongestPalindromicString(text8)
text9 = "abacdedcaba"
findLongestPalindromicString(text9)
Output:
255
Chapter 42. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 3
This is the implementation based on the four cases discussed in Part 2. In Part 4, we have
discussed a different way to look at these four cases and few other approaches.
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/manachers-algorithm-linear-time-longest-palindromic-substring-part-3-2/
256
Chapter 43
We calculate LPS length values from left to right starting from position 0, so we can see if
we already know LPS length values at positions 1, 2 and 3 already then we may not need
to calculate LPS length at positions 4 and 5 because they are equal to LPS length values at
corresponding positions on left side of position 3.
If we look around position 6:
257
Chapter 43. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 2
If we already know LPS length values at positions 1, 2, 3, 4, 5, 6 and 7 already then we may
not need to calculate LPS length at positions 8, 9, 10, 11, 12 and 13 because they are equal
to LPS length values at corresponding positions on left side of position 7.
Can you see why LPS length values are symmetric around positions 3, 6, 9 in string
“abaaba”? That’s because there is a palindromic substring around these positions. Same is
the case in string “abababa” around position 7.
Is it always true that LPS length values around at palindromic center position are always
symmetric (same)?
Answer is NO.
Look at positions 3 and 11 in string “abababa”. Both positions have LPS length 3. Immedi-
ate left and right positions are symmetric (with value 0), but not the next one. Positions 1
and 5 (around position 3) are not symmetric. Similarly, positions 9 and 13 (around position
11) are not symmetric.
At this point, we can see that if there is a palindrome in a string centered at some position,
then LPS length values around the center position may or may not be symmetric depending
on some situation. If we can identify the situation when left and right positions WILL BE
SYMMETRIC around the center position, we NEED NOT calculate LPS length of the right
position because it will be exactly same as LPS value of corresponding position on the left
side which is already known. And this fact where we are avoiding LPS length computation
at few positions makes Manacher’s Algorithm linear.
In situations when left and right positions WILL NOT BE SYMMETRIC around the center
position, we compare characters in left and right side to find palindrome, but here also
algorithm tries to avoid certain no of comparisons. We will see all these scenarios soon.
Let’s introduce few terms to proceed further:
• centerPosition – This is the position for which LPS length is calculated and let’s say
LPS length at centerPosition is d (i.e. L[centerPosition] = d)
258
Chapter 43. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 2
When we are at centerPosition for which LPS length is known, then we also know LPS
length of all positions smaller than centerPosition. Let’s say LPS length at centerPosition
is d, i.e.
L[centerPosition] = d
It means that substring between positions “centerPosition-d” to “centerPosition+d” is a
palindrom.
Now we proceed further to calculate LPS length of positions greater than centerPosition.
Let’s say we are at currentRightPosition ( > centerPosition) where we need to find LPS
length.
For this we look at LPS length of currentLeftPosition which is already calculated.
If LPS length of currentLeftPosition is less than “centerRightPosition – currentRightPosi-
tion”, then LPS length of currentRightPosition will be equal to LPS length of currentLeft-
Position. So
L[currentRightPosition] = L[currentLeftPosition] if L[currentLeftPosition] < centerRightPo-
sition – currentRightPosition. This is Case 1.
Let’s consider below scenario for string “abababa”:
We have calculated LPS length up-to position 7 where L[7] = 7, if we consider position 7 as
centerPosition, then centerLeftPosition will be 0 and centerRightPosition will be 14.
Now we need to calculate LPS length of other positions on the right of centerPosition.
For currentRightPosition = 8, currentLeftPosition is 6 and L[currentLeftPosition] = 0
Also centerRightPosition – currentRightPosition = 14 – 8 = 6
Case 1 applies here and so L[currentRightPosition] = L[8] = 0
Case 1 applies to positions 10 and 12, so,
259
Chapter 43. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 2
L[10] = L[4] = 0
L[12] = L[2] = 0
If we look at position 9, then:
currentRightPosition = 9
currentLeftPosition = 2* centerPosition – currentRightPosition = 2*7 – 9 = 5
centerRightPosition – currentRightPosition = 14 – 9 = 5
Here L[currentLeftPosition] = centerRightPosition – currentRightPosition, so Case 1 doesn’t
apply here. Also note that centerRightPosition is the extreme end position of the string.
That means center palindrome is suffix of input string. In that case, L[currentRightPosition]
= L[currentLeftPosition]. This is Case 2.
Case 2 applies to positions 9, 11, 13 and 14, so:
L[9] = L[5] = 5
L[11] = L[3] = 3
L[13] = L[1] = 1
L[14] = L[0] = 0
What is really happening in Case 1 and Case 2? This is just utilizing the palindromic
symmetric property and without any character match, it is finding LPS length of new
positions.
When a bigger length palindrome contains a smaller length palindrome centered at left side
of it’s own center, then based on symmetric property, there will be another same smaller
palindrome centered on the right of bigger palindrome center. If left side smaller palindrome
is not prefix of bigger palindrome, then Case 1 applies and if it is a prefix AND bigger
palindrome is suffix of the input string itself, then Case 2 applies.
The longest palindrome i places to the right of the current center (the i-right palindrome)
is as long as the longest palindrome i places to the left of the current center (the i-left
palindrome) if the i-left palindrome is completely contained in the longest palindrome around
the current center (the center palindrome) and the i-left palindrome is not a prefix of the
center palindrome (Case 1) or (i.e. when i-left palindrome is a prefix of center palindrome)
if the center palindrome is a suffix of the entire string (Case 2).
In Case 1 and Case 2, i-right palindrome can’t expand more than corresponding i-left palin-
drome (can you visualize why it can’t expand more?), and so LPS length of i-right palin-
drome is exactly same as LPS length of i-left palindrome.
Here both i-left and i-right palindromes are completely contained in center palindrome (i.e.
L[currentLeftPosition] <= centerRightPosition – currentRightPosition)
Now if i-left palindrome is not a prefix of center palindrome (L[currentLeftPosition] < cen-
terRightPosition – currentRightPosition), that means that i-left palindrome was not able to
expand up-to position centerLeftPosition.
If we look at following with centerPosition = 11, then
260
Chapter 43. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 2
261
Chapter 43. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 2
In this case, there is a possibility of i-right palindrome expansion and so length of i-right
palindrome is at least as long as length of i-left palindrome.
Case 4: L[currentRightPosition] > = centerRightPosition – currentRightPosition applies
when:
Source
https://www.geeksforgeeks.org/manachers-algorithm-linear-time-longest-palindromic-substring-part-2/
262
Chapter 44
We have already discussed Naïve [O(n3 )] and quadratic [O(n2 )] approaches at Set 1 and Set
2.
In this article, we will talk about Manacher’s algorithm which finds Longest Palindromic
Substring in linear time.
One way (Set 2) to find a palindrome is to start from the center of the string and compare
characters in both directions one by one. If corresponding characters on both sides (left and
right of the center) match, then they will make a palindrome.
Let’s consider string “abababa”.
Here center of the string is 4th character (with index 3) b. If we match characters in left
and right of the center, all characters match and so string “abababa” is a palindrome.
Here center position is not only the actual string character position but it could be the
position between two characters also.
Consider string “abaaba” of even length. This string is palindrome around the position
between 3rd and 4th characters a and a respectively.
263
Chapter 44. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 1
To find Longest Palindromic Substring of a string of length N, one way is take each possible
2*N + 1 centers (the N character positions, N-1 between two character positions and 2
positions at left and right ends), do the character match in both left and right directions at
each 2*N+ 1 centers and keep track of LPS. This approach takes O(N^2) time and that’s
what we are doing in Set 2.
Let’s consider two strings “abababa” and “abaaba” as shown below:
In these two strings, left and right side of the center positions (position 7 in 1st string and
position 6 in 2nd string) are symmetric. Why? Because the whole string is palindrome
around the center position.
If we need to calculate Longest Palindromic Substring at each 2*N+1 positions from left to
right, then palindrome’s symmetric property could help to avoid some of the unnecessary
computations (i.e. character comparison). If there is a palindrome of some length L cantered
at any position P, then we may not need to compare all characters in left and right side at
position P+1. We already calculated LPS at positions before P and they can help to avoid
some of the comparisons after position P.
This use of information from previous positions at a later point of time makes the Manacher’s
algorithm linear. In Set 2, there is no reuse of previous information and so that is quadratic.
Manacher’s algorithm is probably considered complex to understand, so here we will discuss
it in as detailed way as we can. Some of it’s portions may require multiple reading to
understand it properly.
Let’s look at string “abababa”. In 3rd figure above, 15 center positions are shown. We need
to calculate length of longest palindromic string at each of these positions.
• At position 0, there is no LPS at all (no character on left side to compare), so length
of LPS will be 0.
• At position 1, LPS is a, so length of LPS will be 1.
• At position 2, there is no LPS at all (left and right characters a and b don’t match),
so length of LPS will be 0.
• At position 3, LPS is aba, so length of LPS will be 3.
• At position 4, there is no LPS at all (left and right characters b and a don’t match),
so length of LPS will be 0.
• At position 5, LPS is ababa, so length of LPS will be 5.
…… and so on
We store all these palindromic lengths in an array, say L. Then string S and LPS Length L
look like below:
264
Chapter 44. Manacher’s Algorithm – Linear Time Longest Palindromic Substring – Part 1
In LPS Array L:
• LPS length value at odd positions (the actual character positions) will be odd and
greater than or equal to 1 (1 will come from the center character itself if nothing else
matches in left and right side of it)
• LPS length value at even positions (the positions between two characters, extreme left
and right positions) will be even and greater than or equal to 0 (0 will come when
there is no match in left and right side)
Position and index for the string are two different things here. For a given string
S of length N, indexes will be from 0 to N-1 (total N indexes) and positions will
be from 0 to 2*N (total 2*N+1 positions).
LPS length value can be interpreted in two ways, one in terms of index and second in terms
of position. LPS value d at position I (L[i] = d) tells that:
• Substring from position i-d to i+d is a palindrome of length d (in terms of position)
• Substring from index (i-d)/2 to [(i+d)/2 – 1] is a palindrome of length d (in terms of
index)
e.g. in string “abaaba”, L[3] = 3 means substring from position 0 (3-3) to 6 (3+3) is a
palindrome which is “aba” of length 3, it also means that substring from index 0 [(3-3)/2]
to 2 [(3+3)/2 – 1] is a palindrome which is “aba” of length 3.
Now the main task is to compute LPS array efficiently. Once this array is computed, LPS
of string S will be centered at position with maximum LPS length value.
We will see it in Part 2.
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/manachers-algorithm-linear-time-longest-palindromic-substring-part-1/
265
Chapter 45
266
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
• both strings X and Y (i.e. there is at least one leaf with suffix index in [0,4] and one
leaf with suffix index in [6, 11]
• string X only (i.e. all leaf nodes have suffix indices in [0,4])
• string Y only (i.e. all leaf nodes have suffix indices in [6,11])
Following figure shows the internal nodes marked as “XY”, “X” or “Y” depending on which
string the leaves belong to, that they have below themselves.
267
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
268
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
269
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
{
Node *node =(Node*) malloc(sizeof(Node));
int i;
for (i = 0; i < MAX_CHAR; i++)
node->children[i] = NULL;
270
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
leafEnd = pos;
if (activeLength == 0)
activeEdge = pos; //APCFALZ
271
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
272
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
273
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
{
//Print the label on edge from parent to current node
//Uncomment below line to print suffix tree
//print(n->start, *(n->end));
}
int leaf = 1;
int i;
for (i = 0; i < MAX_CHAR; i++)
{
if (n->children[i] != NULL)
{
//Uncomment below two lines to print suffix index
// if (leaf == 1 && n->start != -1)
// printf(" [%d]\n", n->suffixIndex);
274
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
}
}
if (n->suffixIndex == -1)
free(n->end);
free(n);
}
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
if(n->suffixIndex == -1)
275
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
n->suffixIndex = ret;
else if((n->suffixIndex == -2 && ret == -3) ||
(n->suffixIndex == -3 && ret == -2) ||
n->suffixIndex == -4)
{
n->suffixIndex = -4;//Mark node as XY
//Keep track of deepest node
if(*maxHeight < labelHeight)
{
*maxHeight = labelHeight;
*substringStartIndex = *(n->end) -
labelHeight + 1;
}
}
}
}
}
else if(n->suffixIndex > -1 && n->suffixIndex < size1)//suffix of X
return -2;//Mark node as X
else if(n->suffixIndex >= size1)//suffix of Y
return -3;//Mark node as Y
return n->suffixIndex;
}
void getLongestCommonSubstring()
{
int maxHeight = 0;
int substringStartIndex = 0;
doTraversal(root, 0, &maxHeight, &substringStartIndex);
int k;
for (k=0; k<maxHeight; k++)
printf("%c", text[k + substringStartIndex]);
if(k == 0)
printf("No common substring");
else
printf(", of length: %d",maxHeight);
printf("\n");
}
276
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
freeSuffixTreeByPostOrder(root);
size1 = 10;
printf("Longest Common Substring in xabxaabxa and babxba is: ");
strcpy(text, "xabxaabxa#babxba$"); buildSuffixTree();
getLongestCommonSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 14;
printf("Longest Common Substring in GeeksforGeeks and GeeksQuiz is: ");
strcpy(text, "GeeksforGeeks#GeeksQuiz$"); buildSuffixTree();
getLongestCommonSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 26;
printf("Longest Common Substring in OldSite:GeeksforGeeks.org");
printf(" and NewSite:GeeksQuiz.com is: ");
strcpy(text, "OldSite:GeeksforGeeks.org#NewSite:GeeksQuiz.com$");
buildSuffixTree();
getLongestCommonSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf("Longest Common Substring in abcde and fghie is: ");
strcpy(text, "abcde#fghie$"); buildSuffixTree();
getLongestCommonSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf("Longest Common Substring in pqrst and uvwxyz is: ");
strcpy(text, "pqrst#uvwxyz$"); buildSuffixTree();
getLongestCommonSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
return 0;
}
Output:
277
Chapter 45. Suffix Tree Application 5 – Longest Common Substring
If two strings are of size M and N, then Generalized Suffix Tree construction takes O(M+N)
and LCS finding is a DFS on tree which is again O(M+N).
So overall complexity is linear in time and space.
Followup:
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/suffix-tree-application-5-longest-common-substring-2/
278
Chapter 46
279
Chapter 46. Generalized Suffix Tree 1
Pictorial View:
280
Chapter 46. Generalized Suffix Tree 1
We can use this tree to solve some of the problems, but we can refine it a bit by removing
unwanted substrings on a path label. A path label should have substring from only one input
string, so if there are path labels having substrings from multiple input strings, we can keep
only the initial portion corresponding to one string and remove all the later portion. For
example, for path labels #babxba$, a#babxba$ and bxa#babxba$, we can remove babxba$
(belongs to 2nd input string) and then new path labels will be #, a# and bxa# respectively.
With this change, above diagram will look like below:
281
Chapter 46. Generalized Suffix Tree 1
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
282
Chapter 46. Generalized Suffix Tree 1
283
Chapter 46. Generalized Suffix Tree 1
284
Chapter 46. Generalized Suffix Tree 1
while(remainingSuffixCount > 0) {
if (activeLength == 0)
activeEdge = pos; //APCFALZ
285
Chapter 46. Generalized Suffix Tree 1
lastNewNode = NULL;
}
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
286
Chapter 46. Generalized Suffix Tree 1
lastNewNode = split;
}
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
287
Chapter 46. Generalized Suffix Tree 1
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
288
Chapter 46. Generalized Suffix Tree 1
Output: (You can see that below output corresponds to the 2nd Figure shown above)
# [5]
$ [12]
a [-1]
# [4]
$ [11]
bx [-1]
a# [1]
ba$ [7]
b [-1]
a [-1]
$ [10]
bxba$ [6]
x [-1]
a# [2]
ba$ [8]
x [-1]
a [-1]
# [3]
bxa# [0]
ba$ [9]
If two strings are of size M and N, this implementation will take O(M+N) time and space.
289
Chapter 46. Generalized Suffix Tree 1
If input strings are not concatenated already, then it will take 2(M+N) space in total, M+N
space to store the generalized suffix tree and another M+N space to store concatenated
string.
Followup:
Extend above implementation for more than two strings (i.e. concatenate all strings using
unique terminal symbols and then build suffix tree for concatenated string)
One problem with this approach is the need of unique terminal symbol for each input string.
This will work for few strings but if there is too many input strings, we may not be able to
find that many unique terminal symbols.
We will discuss another approach to build generalized suffix tree soon where we will need
only one unique terminal symbol and that will resolve the above problem and can be used
to build generalized suffix tree for any number of input strings.
We have published following more articles on suffix tree applications:
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/generalized-suffix-tree-1/
290
Chapter 47
0 6 3 1 7 4 2 8 9 5
291
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
If we do a DFS traversal, visiting edges in lexicographic order (we have been doing the same
traversal in other Suffix Tree Application articles as well) and print suffix indices on leaves,
we will get following:
10 0 6 3 1 7 4 2 8 9 5
292
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
293
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
294
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
if (activeLength == 0)
activeEdge = pos; //APCFALZ
295
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
// with activeEdge
Node *next = activeNode->children];
if (walkDown(next))//Do walkdown
{
//Start from next node (the new activeNode)
continue;
}
/*Extension Rule 3 (current character being processed
is already on the edge)*/
if (text[next->start + activeLength] == text[pos])
{
//If a newly created node waiting for it's
//suffix link to be set, then set suffix link
//of that waiting node to curent active node
if(lastNewNode != NULL && activeNode != root)
{
lastNewNode->suffixLink = activeNode;
lastNewNode = NULL;
}
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
296
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
297
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
298
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
299
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
int i = 0;
for(i=0; i< size; i++)
suffixArray[i] = -1;
int idx = 0;
doTraversal(root, suffixArray, &idx);
printf("Suffix Array for String ");
for(i=0; i<size; i++)
printf("%c", text[i]);
printf(" is: ");
for(i=0; i<size; i++)
printf("%d ", suffixArray[i]);
printf("\n");
}
strcpy(text, "GEEKSFORGEEKS$");
buildSuffixTree();
size--;
suffixArray =(int*) malloc(sizeof(int) * size);
buildSuffixArray(suffixArray);
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
free(suffixArray);
strcpy(text, "AAAAAAAAAA$");
buildSuffixTree();
size--;
suffixArray =(int*) malloc(sizeof(int) * size);
buildSuffixArray(suffixArray);
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
free(suffixArray);
strcpy(text, "ABCDEFG$");
buildSuffixTree();
size--;
suffixArray =(int*) malloc(sizeof(int) * size);
300
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
buildSuffixArray(suffixArray);
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
free(suffixArray);
strcpy(text, "ABABABA$");
buildSuffixTree();
size--;
suffixArray =(int*) malloc(sizeof(int) * size);
buildSuffixArray(suffixArray);
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
free(suffixArray);
strcpy(text, "abcabxabcd$");
buildSuffixTree();
size--;
suffixArray =(int*) malloc(sizeof(int) * size);
buildSuffixArray(suffixArray);
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
free(suffixArray);
strcpy(text, "CCAAACCCGATTA$");
buildSuffixTree();
size--;
suffixArray =(int*) malloc(sizeof(int) * size);
buildSuffixArray(suffixArray);
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
free(suffixArray);
return 0;
}
Output:
Ukkonen’s Suffix Tree Construction takes O(N) time and space to build suffix tree for a
string of length N and after that, traversal of tree take O(N) to build suffix array.
301
Chapter 47. Suffix Tree Application 4 – Build Linear Time Suffix Array
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/suffix-tree-application-4-build-linear-time-suffix-array/
302
Chapter 48
This problem can be solved by different approaches with varying time and space complexities.
Here we will discuss Suffix Tree approach (3rd Suffix Tree Application). Other approaches
will be discussed soon.
As a prerequisite, we must know how to build a suffix tree in one or the other way.
Here we will build suffix tree using Ukkonen’s Algorithm, discussed already as below:
Ukkonen’s Suffix Tree Construction – Part 1
Ukkonen’s Suffix Tree Construction – Part 2
Ukkonen’s Suffix Tree Construction – Part 3
Ukkonen’s Suffix Tree Construction – Part 4
Ukkonen’s Suffix Tree Construction – Part 5
Ukkonen’s Suffix Tree Construction – Part 6
Lets look at following figure:
303
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
• Path with Substring “A” has three internal nodes down the tree
• Path with Substring “AB” has two internal nodes down the tree
• Path with Substring “ABA” has two internal nodes down the tree
• Path with Substring “ABAB” has one internal node down the tree
• Path with Substring “ABABA” has one internal node down the tree
• Path with Substring “B” has two internal nodes down the tree
• Path with Substring “BA” has two internal nodes down the tree
• Path with Substring “BAB” has one internal node down the tree
• Path with Substring “BABA” has one internal node down the tree
304
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
is farthest from the root (i.e. deepest node in the tree), because length of substring is the
path label length from root to that internal node.
So finding longest repeated substring boils down to finding the deepest node in suffix tree
and then get the path label from root to that deepest internal node.
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
305
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
306
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
if (activeLength == 0)
activeEdge = pos; //APCFALZ
307
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
{
lastNewNode->suffixLink = activeNode;
lastNewNode = NULL;
}
}
// There is an outgoing edge starting with activeEdge
// from activeNode
else
{
// Get the next node at the end of edge starting
// with activeEdge
Node *next = activeNode->children];
if (walkDown(next))//Do walkdown
{
//Start from next node (the new activeNode)
continue;
}
/*Extension Rule 3 (current character being processed
is already on the edge)*/
if (text[next->start + activeLength] == text[pos])
{
//If a newly created node waiting for it's
//suffix link to be set, then set suffix link
//of that waiting node to curent active node
if(lastNewNode != NULL && activeNode != root)
{
lastNewNode->suffixLink = activeNode;
lastNewNode = NULL;
}
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
308
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
309
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
310
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
{
if (n->children[i] != NULL)
{
freeSuffixTreeByPostOrder(n->children[i]);
}
}
if (n->suffixIndex == -1)
free(n->end);
free(n);
}
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
311
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
substringStartIndex);
}
}
}
else if(n->suffixIndex > -1 &&
(*maxHeight < labelHeight - edgeLength(n)))
{
*maxHeight = labelHeight - edgeLength(n);
*substringStartIndex = n->suffixIndex;
}
}
void getLongestRepeatedSubstring()
{
int maxHeight = 0;
int substringStartIndex = 0;
doTraversal(root, 0, &maxHeight, &substringStartIndex);
// printf("maxHeight %d, substringStartIndex %d\n", maxHeight,
// substringStartIndex);
printf("Longest Repeated Substring in %s is: ", text);
int k;
for (k=0; k<maxHeight; k++)
printf("%c", text[k + substringStartIndex]);
if(k == 0)
printf("No repeated substring");
printf("\n");
}
strcpy(text, "AAAAAAAAAA$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
strcpy(text, "ABCDEFG$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
312
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
strcpy(text, "ABABABA$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
strcpy(text, "ATCGATCGA$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
strcpy(text, "banana$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
strcpy(text, "abcpqrabpqpq$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
strcpy(text, "pqrpqpqabab$");
buildSuffixTree();
getLongestRepeatedSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
return 0;
}
Output:
In case of multiple LRS (As we see in last two test cases), this implementation prints the
LRS which comes 1st lexicographically.
313
Chapter 48. Suffix Tree Application 3 – Longest Repeated Substring
Ukkonen’s Suffix Tree Construction takes O(N) time and space to build suffix tree for a
string of length N and after that finding deepest node will take O(N).
So it is linear in time and space.
Followup questions:
All these problems can be solved in linear time with few changes in above implementation.
We have published following more articles on suffix tree applications:
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/
314
Chapter 49
315
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
This is suffix tree for String “abcabxabcd$”, showing suffix indices and edge label indices
(start, end). The (sub)string value on edges are shown only for explanatory purpose. We
never store path label string in the tree.
Suffix Index of a path tells the index of a substring (starting from root) on that path.
Consider a path “bcd$” in above tree with suffix index 7. It tells that substrings b, bc, bcd,
bcd$ are at index 7 in string.
Similarly path “bxabcd$” with suffix index 4 tells that substrings b, bx, bxa, bxab, bxabc,
bxabcd, bxabcd$ are at index 4.
Similarly path “bcabxabcd$” with suffix index 1 tells that substrings b, bc, bca, bcab, bcabx,
bcabxa, bcabxab, bcabxabc, bcabxabcd, bcabxabcd$ are at index 1.
If we see all the above three paths together, we can see that:
316
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
Can you see how to find all the occurrences of a pattern in a string ?
1. 1st of all, check if the given pattern really exists in string or not (As we did in Substring
Check). For this, traverse the suffix tree against the pattern.
2. If you find pattern in suffix tree (don’t fall off the tree), then traverse the subtree
below that point and find all suffix indices on leaf nodes. All those suffix indices will
be pattern indices in string
317
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
#include <stdlib.h>
#define MAX_CHAR 256
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
318
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
319
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
if (activeLength == 0)
activeEdge = pos; //APCFALZ
320
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
321
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
322
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
if (n == NULL) return;
323
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
324
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
325
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
strcpy(text, "AABAACAADAABAAABAA$");
buildSuffixTree();
printf("\n\nText: AABAACAADAABAAABAA, Pattern to search: AABA");
checkForSubString("AABA");
printf("\n\nText: AABAACAADAABAAABAA, Pattern to search: AA");
checkForSubString("AA");
printf("\n\nText: AABAACAADAABAAABAA, Pattern to search: AAE");
checkForSubString("AAE");
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
strcpy(text, "AAAAAAAAA$");
buildSuffixTree();
printf("\n\nText: AAAAAAAAA, Pattern to search: AAAA");
checkForSubString("AAAA");
printf("\n\nText: AAAAAAAAA, Pattern to search: AA");
checkForSubString("AA");
printf("\n\nText: AAAAAAAAA, Pattern to search: A");
checkForSubString("A");
printf("\n\nText: AAAAAAAAA, Pattern to search: AB");
checkForSubString("AB");
//Free the dynamically allocated memory
326
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
freeSuffixTreeByPostOrder(root);
return 0;
}
Output:
327
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
Ukkonen’s Suffix Tree Construction takes O(N) time and space to build suffix tree for a
string of length N and after that, traversal for substring check takes O(M) for a pattern of
length M and then if there are Z occurrences of the pattern, it will take O(Z) to find indices
of all those Z occurrences.
Overall pattern complexity is linear: O(M + Z).
A bit more detailed analysis
How many internal nodes will there in a suffix tree of string of length N ??
328
Chapter 49. Suffix Tree Application 2 – Searching All Patterns
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/suffix-tree-application-2-searching-all-patterns/
329
Chapter 50
330
Chapter 50. Suffix Tree Application 1 – Substring Check
Here we will build suffix tree using Ukkonen’s Algorithm, discussed already as below:
Ukkonen’s Suffix Tree Construction – Part 1
Ukkonen’s Suffix Tree Construction – Part 2
Ukkonen’s Suffix Tree Construction – Part 3
Ukkonen’s Suffix Tree Construction – Part 4
Ukkonen’s Suffix Tree Construction – Part 5
Ukkonen’s Suffix Tree Construction – Part 6
The core traversal implementation for substring check, can be modified accordingly for suffix
trees built by other algorithms.
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
331
Chapter 50. Suffix Tree Application 1 – Substring Check
332
Chapter 50. Suffix Tree Application 1 – Substring Check
if (activeLength == 0)
activeEdge = pos; //APCFALZ
333
Chapter 50. Suffix Tree Application 1 – Substring Check
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
334
Chapter 50. Suffix Tree Application 1 – Substring Check
335
Chapter 50. Suffix Tree Application 1 – Substring Check
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
336
Chapter 50. Suffix Tree Application 1 – Substring Check
if (n == NULL)
return;
int i;
for (i = 0; i < MAX_CHAR; i++)
{
if (n->children[i] != NULL)
{
freeSuffixTreeByPostOrder(n->children[i]);
}
}
if (n->suffixIndex == -1)
free(n->end);
free(n);
}
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
337
Chapter 50. Suffix Tree Application 1 – Substring Check
checkForSubString("TEST");
checkForSubString("A");
checkForSubString(" ");
checkForSubString("IS A");
checkForSubString(" IS A ");
checkForSubString("TEST1");
338
Chapter 50. Suffix Tree Application 1 – Substring Check
checkForSubString("THIS IS GOOD");
checkForSubString("TES");
checkForSubString("TESA");
checkForSubString("ISB");
return 0;
}
Output:
Ukkonen’s Suffix Tree Construction takes O(N) time and space to build suffix tree for a
string of length N and after that, traversal for substring check takes O(M) for a pattern of
length M.
With slight modification in traversal algorithm discussed here, we can answer following:
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
339
Chapter 50. Suffix Tree Application 1 – Substring Check
Source
https://www.geeksforgeeks.org/suffix-tree-application-1-substring-check/
340
Chapter 51
• children – This will be an array of alphabet size. This will store all the children
nodes of current node on different edges starting with different characters.
• suffixLink – This will point to other node where current node should point via suffix
link.
• start, end – These two will store the edge label details from parent node to current
node. (start, end) interval specifies the edge, by which the node is connected to its
341
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
parent node. Each edge will connect two nodes, one parent and one child, and (start,
end) interval of a given edge will be stored in the child node. Lets say there are two
nods A (parent) and B (Child) connected by an edge with indices (5, 8) then this
indices (5, 8) will be stored in node B.
• suffixIndex – This will be non-negative for leaves and will give index of suffix for the
path from root to this leaf. For non-leaf node, it will be -1 .
This data structure will answer to the required queries quickly as below:
• How to check if a node is root ? — Root is a special node, with no parent and so it’s
start and end will be -1, for all other nodes, start and end indices will be non-negative.
• How to check if a node is internal or leaf node ? — suffixIndex will help here. It will
be -1 for internal node and non-negative for leaf nodes.
• What is the length of path label on some edge? — Each edge will have start and end
indices and length of path label will be end-start+1
• What is the path label on some edge ? — If string is S, then path label will be
substring of S from start index to end index inclusive, [start, end].
• How to check if there is an outgoing edge for a given character c from a node A ? —
If A->children[c] is not NULL, there is a path, if NULL, no path.
• What is the character value on an edge at some given distance d from a node A ? —
Character at distance d from node A will be S[A->start + d], where S is the string.
• Where an internal node is pointing via suffix link ? — Node A will point to A-
>suffixLink
• What is the suffix index on a path from root to leaf ? — If leaf node is A on the path,
then suffix index on that path will be A->suffixIndex
Following is C implementation of Ukkonen’s Suffix Tree Construction. The code may look
a bit lengthy, probably because of a good amount of comments.
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
342
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
343
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
node->suffixLink = root;
node->start = start;
node->end = end;
344
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
if (activeLength == 0)
activeEdge = pos; //APCFALZ
345
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
lastNewNode->suffixLink = activeNode;
lastNewNode = NULL;
}
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break;
}
346
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return;
347
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen(text);
int i;
rootEnd = (int*) malloc(sizeof(int));
*rootEnd = - 1;
348
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
setSuffixIndexByDFS(root, labelHeight);
Output (Each edge of Tree, along with suffix index of child node on edge, is printed in DFS
order. To understand the output better, match it with the last figure no 43 in previous Part
5 article):
$ [10]
ab [-1]
c [-1]
abxabcd$ [0]
d$ [6]
xabcd$ [3]
b [-1]
c [-1]
abxabcd$ [1]
d$ [7]
xabcd$ [4]
c [-1]
abxabcd$ [2]
d$ [8]
d$ [9]
xabcd$ [5]
Now we are able to build suffix tree in linear time, we can solve many string problem in
efficient way:
• Check if a given pattern P is substring of text T (Useful when text is fixed and pattern
changes, KMP otherwise
349
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
The above basic problems can be solved by DFS traversal on suffix tree.
We will soon post articles on above problems and others like below:
And More.
Test you understanding?
1. Draw suffix tree (with proper suffix link, suffix indices) for string “AABAA-
CAADAABAAABAA$” on paper and see if that matches with code output.
2. Every extension must follow one of the three rules: Rule 1, Rule 2 and Rule 3.
Following are the rules applied on five consecutive extensions in some Phase i (i > 5),
which ones are valid:
A) Rule 1, Rule 2, Rule 2, Rule 3, Rule 3
B) Rule 1, Rule 2, Rule 2, Rule 3, Rule 2
C) Rule 2, Rule 1, Rule 1, Rule 3, Rule 3
D) Rule 1, Rule 1, Rule 1, Rule 1, Rule 1
E) Rule 2, Rule 2, Rule 2, Rule 2, Rule 2
F) Rule 3, Rule 3, Rule 3, Rule 3, Rule 3
3. What are the valid sequences in above for Phase 5
4. Every internal node MUST have it’s suffix link set to another node (internal or root).
Can a newly created node point to already existing internal node or not ? Can it
happen that a new node created in extension j, may not get it’s right suffix link in
next extension j+1 and get the right one in later extensions like j+2, j+3 etc ?
5. Try solving the basic problems discussed above.
References:
http://web.stanford.edu/~mjkay/gusfield.pdf
Ukkonen’s suffix tree algorithm in plain English
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
350
Chapter 51. Ukkonen’s Suffix Tree Construction – Part 6
Source
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-6/
351
Chapter 52
352
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
• If activeLength is ZERO [activePoint in previous phase was (root, x, 0)], set activeEdge
to the current character (here activeEdge will be ‘a’). This is APCFALZ. Now
activePoint becomes (root, ‘a’, 0).
• Check if there is an edge going out from activeNode (which is root in this phase 7)
for the activeEdge. If not, create a leaf edge. If present, walk down. In our example,
edge ‘a’ is present going out of activeNode (i.e. root), here we increment activeLength
from zero to 1 (APCFER3) and stop any further processing.
• At this point, activePoint is (root, a, 1) and remainingSuffixCount remains set to 1
(no change there)
At the end of phase 7, remainingSuffixCount is 1 (One suffix ‘a’, the last one, is not added
explicitly in tree, but it is there in tree implicitly).
Above Figure 33 is the resulting tree after phase 7.
*********************Phase 8*********************************
In phase 8, we read 8th character (b) from string S
353
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
• Check if there is an edge going out from activeNode (which is root in this phase 8)
for the activeEdge. If not, create a leaf edge. If present, walk down. In our example,
edge ‘a’ is present going out of activeNode (i.e. root).
• Do a walk down (The trick 1 – skip/count) if necessary. In current phase 8, no
walk down needed as activeLength < edgeLength. Here activePoint is (root, a, 1) for
extension 7 (remainingSuffixCount = 2)
• Check if current character of string S (which is ‘b’) is already present after the active-
Point. If yes, no more processing (rule 3). Same is the case in our example, so we
increment activeLength from 1 to 2 (APCFER3) and we stop here (Rule 3).
• At this point, activePoint is (root, a, 2) and remainingSuffixCount remains set to 2
(no change in remainingSuffixCount)
At the end of phase 8, remainingSuffixCount is 2 (Two suffixes, ‘ab’ and ‘b’, the last two,
are not added explicitly in tree explicitly, but they are in tree implicitly).
*********************Phase 9*********************************
In phase 9, we read 9th character (c) from string S
354
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
• Check if there is an edge going out from activeNode (which is root in this phase 9)
for the activeEdge. If not, create a leaf edge. If present, walk down. In our example,
edge ‘a’ is present going out of activeNode (i.e. root).
• Do a walk down (The trick 1 – skip/count) if necessary. In current phase 9,walk down
needed as activeLength(2) >= edgeLength(2). While walk down, activePoint changes
to (Node A, c, 0) based on APCFWD (This is first time APCFWD is being applied
in our example).
• Check if current character of string S (which is ‘c’) is already present after the active-
Point. If yes, no more processing (rule 3). Same is the case in our example, so we
increment activeLength from 0 to 1 (APCFER3) and we stop here (Rule 3).
• At this point, activePoint is (Node A, c, 1) and remainingSuffixCount remains set to
3 (no change in remainingSuffixCount)
At the end of phase 9, remainingSuffixCount is 3 (Three suffixes, ‘abc’, ‘bc’ and ‘c’, the last
three, are not added explicitly in tree explicitly, but they are in tree implicitly).
*********************Phase 10*********************************
In phase 10, we read 10th character (d) from string S
355
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
• Check if there is an edge going out from activeNode (Node A) for the activeEdge(c).
If not, create a leaf edge. If present, walk down. In our example, edge ‘c’ is present
going out of activeNode (Node A).
• Do a walk down (The trick 1 – skip/count) if necessary. In current Extension 7, no
walk down needed as activeLength < edgeLength.
• Check if current character of string S (which is ‘d’) is already present after the ac-
tivePoint. If not, rule 2 will apply. In our example, there is no path starting with ‘d’
going out of activePoint, so we create a leaf edge with label ‘d’. Since activePoint ends
in the middle of an edge, we will create a new internal node just after the activePoint
(Rule 2)
356
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
357
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
358
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
359
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
• Internal Nodes connected via suffix links have exactly same tree below them, e.g. In
above Figure 40, A and B have same tree below them, similarly C, D and E have same
tree below them.
• Due to above fact, in any extension, when current activeNode is derived via suffix
link from previous extension’s activeNode, then exactly same extension logic apply in
current extension as previous extension. (In Phase 10, same extension logic is applied
in extensions 7, 8 and 9)
• If a new internal node gets created in extension j of any phase i, then this newly
created internal node will get it’s suffix link set by the end of next extension j+1 of
same phase i. e.g. node C got created in extension 7 of phase 10 (Figure 37) and it
got it’s suffix link set to node D in extension 8 of same phase 10 (Figure 38). Similarly
node D got created in extension 8 of phase 10 (Figure 38) and it got its suffix link set
to node E in extension 9 of same phase 10 (Figure 39). Similarly node E got created
in extension 9 of phase 10 (Figure 39) and it got its suffix link set to root in extension
10 of same phase 10 (Figure 40).
360
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
• Based on above fact, every internal node will have a suffix link to some other internal
node or root. Root is not an internal node and it will not have suffix link.
*********************Phase 11*********************************
In phase 11, we read 11th character ($) from string S
• Set END to 11 (This will do extensions 1 to 10) – because we have 10 leaf edges so
far by the end of previous phase 10.
• Increment remainingSuffixCount by 1 (from 0 to 1), i.e. there is only one suffix ‘$’ to
be added in tree.
• Since activeLength is ZERO, activeEdge will change to current character ‘$’ of string
S being processed (APCFALZ).
• There is no edge going out from activeNode root, so a leaf edge with label ‘$’ will be
created (Rule 2).
361
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
Now we have added all suffixes of string ‘abcabxabcd$’ in suffix tree. There are 11 leaf
ends in this tree and labels on the path from root to leaf end represents one suffix. Now the
only one thing left is to assign a number (suffix index) to each leaf end and that number
would be the suffix starting position in the string S. This can be done by a DFS traversal
on tree. While DFS traversal, keep track of label length and when a leaf end is found, set
the suffix index as “stringSize – labelSize + 1”. Indexed suffix tree will look like below:
362
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
In above Figure, suffix indices are shown as character position starting with 1 (It’s not zero
indexed). In code implementation, suffix index will be set as zero indexed, i.e. where we see
suffix index j (1 to m for string of length m) in above figure, in code implementation, it will
be j-1 (0 to m-1)
And we are done !!!!
We may think of different data structures which can fulfil these requirements.
In the next Part 6, we will discuss the data structure we will use in our code implementation
and the code as well.
References:
http://web.stanford.edu/~mjkay/gusfield.pdf
363
Chapter 52. Ukkonen’s Suffix Tree Construction – Part 5
Source
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-5/
364
Chapter 53
*********************Phase 1*********************************
In Phase 1, we read 1st character (a) from string S
• Set END to 1
• Increment remainingSuffixCount by 1 (remainingSuffixCount will be 1 here, i.e. there
is 1 extension left to be performed)
• Run a loop remainingSuffixCount times (i.e. one time) as below:
– If activeLength is ZERO, set activeEdge to the current character (here activeEdge
will be ‘a’). This is APCFALZ.
365
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
– Check if there is an edge going out from activeNode (which is root in this phase
1) for the activeEdge. If not, create a leaf edge. If present, walk down. In our
example, leaf edge gets created (Rule 2).
– Once extension is performed, decrement the remainingSuffixCount by 1
– At this point, activePoint is (root, a, 0)
At the end of phase 1, remainingSuffixCount is ZERO (All suffixes are added explicitly).
Figure 20 in Part 3 is the resulting tree after phase 1.
*********************Phase 2*********************************
In Phase 2, we read 2nd character (b) from string S
Set END to 2 (This will do extension 1)
Increment remainingSuffixCount by 1 (remainingSuffixCount will be 1 here, i.e. there is 1
extension left to be performed)
Run a loop remainingSuffixCount times (i.e. one time) as below:
• If activeLength is ZERO, set activeEdge to the current character (here activeEdge will
be ‘b’). This is APCFALZ.
• Check if there is an edge going out from activeNode (which is root in this phase 2) for
the activeEdge. If not, create a leaf edge. If present, walk down. In our example, leaf
edge gets created.
• Once extension is performed, decrement the remainingSuffixCount by 1
• At this point, activePoint is (root, b, 0)
At the end of phase 2, remainingSuffixCount is ZERO (All suffixes are added explicitly).
Figure 22 in Part 3 is the resulting tree after phase 2.
*********************Phase 3*********************************
In Phase 3, we read 3rd character (c) from string S
Set END to 3 (This will do extensions 1 and 2)
Increment remainingSuffixCount by 1 (remainingSuffixCount will be 1 here, i.e. there is 1
extension left to be performed)
Run a loop remainingSuffixCount times (i.e. one time) as below:
• If activeLength is ZERO, set activeEdge to the current character (here activeEdge will
be ‘c’). This is APCFALZ.
• Check if there is an edge going out from activeNode (which is root in this phase 3) for
the activeEdge. If not, create a leaf edge. If present, walk down. In our example, leaf
edge gets created.
• Once extension is performed, decrement the remainingSuffixCount by 1
• At this point, activePoint is (root, c, 0)
At the end of phase 3, remainingSuffixCount is ZERO (All suffixes are added explicitly).
Figure 25 in Part 3 is the resulting tree after phase 3.
366
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
*********************Phase 4*********************************
In Phase 4, we read 4th character (a) from string S
Set END to 4 (This will do extensions 1, 2 and 3)
Increment remainingSuffixCount by 1 (remainingSuffixCount will be 1 here, i.e. there is 1
extension left to be performed)
Run a loop remainingSuffixCount times (i.e. one time) as below:
• If activeLength is ZERO, set activeEdge to the current character (here activeEdge will
be ‘a’). This is APCFALZ.
• Check if there is an edge going out from activeNode (which is root in this phase 3)
for the activeEdge. If not, create a leaf edge. If present, walk down (The trick 1 –
skip/count). In our example, edge ‘a’ is present going out of activeNode (i.e. root).
No walk down needed as activeLength < edgeLength. We increment activeLength
from zero to 1 (APCFER3) and stop any further processing (Rule 3).
• At this point, activePoint is (root, a, 1) and remainingSuffixCount remains set to 1
(no change there)
At the end of phase 4, remainingSuffixCount is 1 (One suffix ‘a’, the last one, is not added
explicitly in tree, but it is there in tree implicitly).
Figure 28 in Part 3 is the resulting tree after phase 4.
Revisiting completed for 1st four phases, we will continue building the tree and see
how it goes.
*********************Phase 5*********************************
In phase 5, we read 5th character (b) from string S
Set END to 5 (This will do extensions 1, 2 and 3). See Figure 29 shown below.
Increment remainingSuffixCount by 1 (remainingSuffixCount will be 2 here, i.e. there are
2 extension left to be performed, which are extensions 4 and 5. Extension 4 is supposed to
add suffix “ab” and extension 5 is supposed to add suffix “b” in tree)
Run a loop remainingSuffixCount times (i.e. two times) as below:
• Check if there is an edge going out from activeNode (which is root in this phase 3)
for the activeEdge. If not, create a leaf edge. If present, walk down. In our example,
edge ‘a’ is present going out of activeNode (i.e. root).
• Do a walk down (The trick 1 – skip/count) if necessary. In current phase 5, no
walk down needed as activeLength < edgeLength. Here activePoint is (root, a, 1) for
extension 4 (remainingSuffixCount = 2)
• Check if current character of string S (which is ‘b’) is already present after the active-
Point. If yes, no more processing (rule 3). Same is the case in our example, so we
increment activeLength from 1 to 2 (APCFER3) and we stop here (Rule 3).
• At this point, activePoint is (root, a, 2) and remainingSuffixCount remains set to 2
(no change in remainingSuffixCount)
367
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
At the end of phase 5, remainingSuffixCount is 2 (Two suffixes, ‘ab’ and ‘b’, the last two,
are not added explicitly in tree, but they are in tree implicitly).
*********************Phase 6*********************************
In phase 6, we read 6th character (x) from string S
Set END to 6 (This will do extensions 1, 2 and 3)
• While extension 4, the activePoint is (root, a, 2) which points to ‘b’ on edge starting
with ‘a’.
• In extension 4, current character ‘x’ from string S doesn’t match with the next char-
acter on the edge after activePoint, so this is the case of extension rule 2. So a leaf
edge is created here with edge label x. Also here traversal ends in middle of an edge,
so a new internal node also gets created at the end of activePoint.
• Decrement the remainingSuffixCount by 1 (from 3 to 2) as suffix “abx” added in tree.
368
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
Now activePoint will change after applying rule 2. Three other cases, (APCFER3,
APCFWD and APCFALZ) where activePoint changes, are already discussed in Part 3.
activePoint change for extension rule 2 (APCFER2):
Case 1 (APCFER2C1): If activeNode is root and activeLength is greater than ZERO,
then decrement the activeLength by 1 and activeEdge will be set “S[i – remainingSuffixCount
+ 1]” where i is current phase number. Can you see why this change in activePoint? Look
at current extension we just discussed above for phase 6 (i=6) again where we added suffix
“abx”. There activeLength is 2 and activeEdge is ‘a’. Now in next extension, we need to add
suffix “bx” in the tree, i.e. path label in next extension should start with ‘b’. So ‘b’ (the
5th character in string S) should be active edge for next extension and index of b will be “i
– remainingSuffixCount + 1” (6 – 2 + 1 = 5). activeLength is decremented by 1 because
activePoint gets closer to root by length 1 after every extension.
What will happen If activeNode is root and activeLength is ZERO? This case is already
taken care by APCFALZ.
Case 2 (APCFER2C2): If activeNode is not root, then follow the suffix link from current
activeNode. The new node (which can be root node or another internal node) pointed
by suffix link will be the activeNode for next extension. No change in activeLength and
activeEdge. Can you see why this change in activePoint? This is because: If two nodes
are connected by a suffix link, then labels on all paths going down from those two nodes,
starting with same character, will be exactly same and so for two corresponding similar
point on those paths, activeEdge and activeLength will be same and the two nodes will be
the activeNode. Look at Figure 18 in Part 2. Let’s say in phase i and extension j, suffix
‘xAabcdedg’ was added in tree. At that point, let’s say activePoint was (Node-V, a, 7), i.e.
point ‘g’. So for next extension j+1, we would add suffix ‘Aabcdefg’ and for that we need to
traverse 2nd path shown in Figure 18. This can be done by following suffix link from current
activeNode v. Suffix link takes us to the path to be traversed somewhere in between [Node
369
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
s(v)] below which the path is exactly same as how it was below the previous activeNode
v. As said earlier, “activePoint gets closer to root by length 1 after every extension”, this
reduction in length will happen above the node s(v) but below s(v), no change at all. So
when activeNode is not root in current extension, then for next extension, only activeNode
changes (No change in activeEdge and activeLength).
370
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
extension’s internal node goes to root (as no new internal node created in current
extension 6).
• Decrement the remainingSuffixCount by 1 (from 1 to 0) as suffix “x” added in tree
• A newly created internal node in extension i, points to another internal node or root
(if activeNode is root in extension i+1) by the end of extension i+1 via suffix link
(Every internal node MUST have a suffix link pointing to another internal node or
root)
• Suffix link provides short cut while searching path label end of next suffix
• With proper tracking of activePoints between extensions/phases, unnecessary walk-
down from root can be avoided.
We will go through rest of the phases (7 to 11) in Part 5 and build the tree completely and
after that, we will see the code for the algorithm in Part 6.
References:
371
Chapter 53. Ukkonen’s Suffix Tree Construction – Part 4
http://web.stanford.edu/~mjkay/gusfield.pdf
Ukkonen’s suffix tree algorithm in plain English
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-4/
372
Chapter 54
373
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
• Rule 3 ends the current phase (when current character is found in current edge being
traversed)
• Phase 1 will read first character from the string, will go through 1 extension.
(In figures, we are showing characters on edge labels just for explanation,
while writing code, we will only use start and end indices – The Edge-label
compression discussed in Part 2)
Extension 1 will add suffix “a” in tree. We start from root and traverse path with
label ‘a’. There is no path from root, going out with label ‘a’, so create a leaf edge
(Rule 2).
Phase 1 completes with the completion of extension 1 (As a phase i has at most i
extensions)
For any string, Phase 1 will have only one extension and it will always follow Rule 2.
• Phase 2 will read second character, will go through at least 1 and at most 2 extensions.
In our example, phase 2 will read second character ‘b’. Suffixes to be added are “ab”
and “b”.
Extension 1 adds suffix “ab” in tree.
Path for label ‘a’ ends at leaf edge, so add ‘b’ at the end of this edge.
Extension 1 just increments the end index by 1 (from 1 to 2) on first edge (Rule 1).
374
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
Extension 2 adds suffix “b” in tree. There is no path from root, going out with label
‘b’, so creates a leaf edge (Rule 2).
375
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
Extension 3 adds suffix “c” in tree. There is no path from root, going out with label
‘c’, so creates a leaf edge (Rule 2).
376
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
377
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
1. At the end of any phase i, there are at most i leaf edges (if ith character is not seen so
far, there will be i leaf edges, else there will be less than i leaf edges).
e.g. After phases 1, 2 and 3 in our example, there are 1, 2 and 3 leaf edges respectively,
but after phase 4, there are 3 leaf edges only (not 4).
378
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
2. After completing phase i, “end” indices of all leaf edges are i. How do we implement
this in code? Do we need to iterate through all those extensions, find leaf edges by
traversing from root to leaf and increment the “end” index? Answer is “NO”.
For this, we will maintain a global variable (say “END”) and we will just increment this
global variable “END” and all leaf edge end indices will point to this global variable.
So this way, if we have j leaf edges after phase i, then in phase i+1, first j extensions
(1 to j) will be done by just incrementing variable “END” by 1 (END will be i+1 at
the point).
Here we just implemented the trick 3 – Once a leaf, always a leaf. This trick
processes all the j leaf edges (i.e. extension 1 to j) using rule 1 in a constant time in
any phase. Rule 1 will not apply to subsequent extensions in the same phase. This
can be verified in the four phases we discussed above. If at all Rule 1 applies in any
phase, it only applies in initial few phases continuously (say 1 to j). Rule 1 never
applies later in a given phase once Rule 2 or Rule 3 is applied in that phase.
3. In the example explained so far, in each extension (where trick 3 is not applied) of
any phase to add a suffix in tree, we are traversing from root by matching path labels
against the suffix being added. If there are j leaf edges after phase i, then in phase
i+1, first j extensions will follow Rule 1 and will be done in constant time using trick
3. There are i+1-j extensions yet to be performed. For these extensions, which node
(root or some other internal node) to start from and which path to go? Answer to this
depends on how previous phase i is completed.
If previous phase i went through all the i extensions (when ith character is unique so
far), then in next phase i+1, trick 3 will take care of first i suffixes (the i leaf edges)
and then extension i+1 will start from root node and it will insert just one character
[(i+1)th ] suffix in tree by creating a leaf edge using Rule 2.
If previous phase i completes early (and this will happen if and only if rule 3 applies
– when ith character is already seen before), say at jth extension (i.e. rule 3 is applied
at jth extension), then there are j-1 leaf edges so far.
We will state few more facts (which may be a repeat, but we want to make sure it’s
clear to you at this point) here based on discussion so far:
• Phase 1 starts with Rule 2, all other phases start with Rule 1
• Any phase ends with either Rule 2 or Rule 3
• Any phase i may go through a series of j extensions (1 <= j <= i). In these j
extensions, first p (0 <= p < i) extensions will follow Rule 1, next q (0 <= q
<= i-p) extensions will follow Rule 2 and next r (0<= r <= 1) extensions will
follow Rule 3. The order in which Rule 1, Rule 2 and Rule 3 apply, is never
intermixed in a phase. They apply in order of their number (if at all applied), i.e.
in a phase, Rule 1 applies 1st, then Rule 2 and then Rule 3
• In a phase i, p + q + r <= i
• At the end of any phase i, there will be p+q leaf edges and next phase i+1 will
go through Rule 1 for first p+q extensions
In the next phase i+1, trick 3 (Rule 1) will take care of first j-1 suffixes (the j-1 leaf
edges), then extension j will start where we will add jth suffix in tree. For this, we
need to find the best possible matching edge and then add new character at the end
of that edge. How to find the end of best matching edge? Do we need to traverse
from root node and match tree edges against the jth suffix being added character by
379
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
character? This will take time and overall algorithm will not be linear. activePoint
comes to the rescue here.
In previous phase i, while jth extension, path traversal ended at a point (which could
be an internal node or some point in the middle of an edge) where ith character
being added was found in tree already and Rule 3 applied, jth extension of phase i+1
will start exactly from the same point and we start matching path against (i+1)th
character. activePoint helps to avoid unnecessary path traversal from root in any
extension based on the knowledge gained in traversals done in previous extension.
There is no traversal needed in 1st p extensions where Rule 1 is applied. Traversal
is done where Rule 2 or Rule 3 gets applied and that’s where activePoint tells the
starting point for traversal where we match the path against the current character
being added in tree. Implementation is done in such a way that, in any extension where
we need a traversal, activePoint is set to right location already (with one exception case
APCFALZ discussed below) and at the end of current extension, we reset activePoint
as apprppriate so that next extension (of same phase or next phase) where a traversal
is required, activePoint points to the right place already.
activePoint: This could be root node, any internal node or any point in the
middle of an edge. This is the point where traversal starts in any extension. For
the 1st extension of phase 1, activePoint is set to root. Other extension will get
activePoint set correctly by previous extension (with one exception case APCFALZ
discussed below) and it is the responsibility of current extension to reset activePoint
appropriately at the end, to be used in next extension where Rule 2 or Rule 3 is
applied (of same or next phase).
To accomplish this, we need a way to store activePoint. We will store this using three
variables: activeNode, activeEdge, activeLength.
activeNode: This could be root node or an internal node.
activeEdge: When we are on root node or internal node and we need to walk down,
we need to know which edge to choose. activeEdge will store that information. In
case, activeNode itself is the point from where traversal starts, then activeEdge will
be set to next character being processed in next phase.
activeLength: This tells how many characters we need to walk down (on the path
represented by activeEdge) from activeNode to reach the activePoint where traversal
starts. In case, activeNode itself is the point from where traversal starts, then
activeLength will be ZERO.
380
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
After phase i, if there are j leaf edges then in phase i+1, first j extensions will be
done by trick 3. activePoint will be needed for the extensions from j+1 to i+1 and
activePoint may or may not change between two extensions depending on the point
where previous extension ends.
activePoint change for extension rule 3 (APCFER3): When rule 3 applies in
any phase i, then before we move on to next phase i+1, we increment activeLength
by 1. There is no change in activeNode and activeEdge. Why? Because in case of
rule 3, the current character from string S is matched on the same path represented
by current activePoint, so for next activePoint, activeNode and activeEdge remain
the same, only activeLenth is increased by 1 (because of matched character in current
phase). This new activePoint (same node, same edge and incremented length) will be
used in phase i+1.
activePoint change for walk down (APCFWD): activePoint may change at the
end of an extension based on extension rule applied. activePoint may also change
during the extension when we do walk down. Let’s consider an activePoint is (A, s,
11) in the above activePoint example figure. If this is the activePoint at the start of
some extension, then while walk down from activeNode A, other internal nodes will
be seen. Anytime if we encounter an internal node while walk down, that node will
become activeNode (it will change activeEdge and activeLenght as appropriate so that
new activePoint represents the same point as earlier). In this walk down, below is the
sequence of changes in activePoint:
(A, s, 11) — >>> (B, w, 7) —- >>> (C, a, 3)
All above three activePoints refer to same point ‘c’
Let’s take another example.
If activePoint is (D, a, 11) at the start of an extension, then while walk down, below
is the sequence of changes in activePoint:
(D, a, 10) — >>> (E, d, 7) — >>> (F, f, 5) — >> (G, j, 1)
All above activePoints refer to same point ‘k’.
If activePoints are (A, s, 3), (A, t, 5), (B, w, 1), (D, a, 2) etc when no internal node
comes in the way while walk down, then there will be no change in activePoint for
APCFWD.
381
Chapter 54. Ukkonen’s Suffix Tree Construction – Part 3
The idea is that, at any time, the closest internal node from the point, where we want
to reach, should be the activePoint. Why? This will minimize the length of traversal
in the next extension.
activePoint change for Active Length ZERO (APCFALZ): Let’s consider an
activePoint (A, s, 0) in the above activePoint example figure. And let’s say current
character being processed from string S is ‘x’ (or any other character). At the start
of extension, when activeLength is ZERO, activeEdge is set to the current character
being processed, i.e. ‘x’, because there is no walk down needed here (as activeLength
is ZERO) and so next character we look for is current character being processed.
4. While code implementation, we will loop through all the characters of string S one by
one. Each loop for ith character will do processing for phase i. Loop will run one or
more time depending on how many extensions are left to be performed (Please note
that in a phase i+1, we don’t really have to perform all i+1 extensions explicitly, as
trick 3 will take care of j extensions for all j leaf edges coming from previous phase i).
We will use a variable remainingSuffixCount, to track how many extensions are yet
to be performed explicitly in any phase (after trick 3 is performed). Also, at the end
of any phase, if remainingSuffixCount is ZERO, this tells that all suffixes supposed
to be added in tree, are added explicitly and present in tree. If remainingSuffixCount
is non-zero at the end of any phase, that tells that suffixes of that many count are
not added in tree explicitly (because of rule 3, we stopped early), but they are in tree
implicitly though (Such trees are called implicit suffix tree). These implicit suffixes
will be added explicitly in subsequent phases when a unique character comes in the
way.
We will continue our discussion in Part 4 and Part 5. Code implementation will be discussed
in Part 6.
References:
http://web.stanford.edu/~mjkay/gusfield.pdf
Ukkonen’s suffix tree algorithm in plain English
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-3/
382
Chapter 55
383
Chapter 55. Ukkonen’s Suffix Tree Construction – Part 2
In extension j of some phase i, if a new internal node v with path-label xA is added, then
in extension j+1 in the same phase i:
• Either the path labelled A already ends at an internal node (or root node if A is
empty)
• OR a new internal node at the end of string A will be created
In extension j+1 of same phase i, we will create a suffix link from the internal node created
in jth extension to the node with path labelled A.
So in a given phase, any newly created internal node (with path-label xA) will have a suffix
link from it (pointing to another node with path-label A) by the end of the next extension.
In any implicit suffix tree Ti after phase i, if internal node v has path-label xA, then there
is a node s(v) in Ti with path-label A and node v will point to node s(v) using suffix link.
At any time, all internal nodes in the changing tree will have suffix links from them to
another internal node (or root) except for the most recently added internal node, which will
receive its suffix link by the end of the next extension.
How suffix links are used to speed up the implementation?
In extension j of phase i+1, we need to find the end of the path from the root labelled
S[j..i] in the current tree. One way is start from root and traverse the edges matching S[j..i]
string. Suffix links provide a short cut to find end of the path.
384
Chapter 55. Ukkonen’s Suffix Tree Construction – Part 2
385
Chapter 55. Ukkonen’s Suffix Tree Construction – Part 2
So we can see that, to find end of path S[j..i], we need not traverse from root. We can start
from the end of path S[j-1..i], walk up one edge to node v (i.e. go to parent node), follow
the suffix link to s(v), then walk down the path y (which is abcd here in Figure 17).
This shows the use of suffix link is an improvement over the process.
Note: In the next part 3, we will introduce activePoint which will help to avoid “walk up”.
We can directly go to node s(v) from node v.
When there is a suffix link from node v to node s(v), then if there is a path labelled with
string y from node v to a leaf, then there must be a path labelled with string y from node
s(v) to a leaf. In Figure 17, there is a path label “abcd” from node v to a leaf, then there
is a path will same label “abcd” from node s(v) to a leaf.
This fact can be used to improve the walk from s(v) to leaf along the path y. This is called
“skip/count” trick.
Skip/Count Trick
When walking down from node s(v) to leaf, instead of matching path character by character
as we travel, we can directly skip to the next node if number of characters on the edge
is less than the number of characters we need to travel. If number of characters on the
edge is more than the number of characters we need to travel, we directly skip to the last
character on that edge.
If implementation is such a way that number of characters on any edge, character at a
given position in string S should be obtained in constant time, then skip/count trick will
do the walk down in proportional to the number of nodes on it rather than the number of
characters on it.
Using suffix link along with skip/count trick, suffix tree can be built in O(m2 ) as there are
386
Chapter 55. Ukkonen’s Suffix Tree Construction – Part 2
There are two observations about the way extension rules interact in successive extensions
and phases. These two observations lead to two more implementation tricks (first trick
“skip/count” is seen already while walk down).
Observation 1: Rule 3 is show stopper
In a phase i, there are i extensions (1 to i) to be done.
When rule 3 applies in any extension j of phase i+1 (i.e. path labelled S[j..i] continues
with character S[i+1]), then it will also apply in all further extensions of same phase (i.e.
extensions j+1 to i+1 in phase i+1). That’s because if path labelled S[j..i] continues with
character S[i+1], then path labelled S[j+1..i], S[j+2..i], S[j+3..i],…, S[i..i] will also continue
with character S[i+1].
Consider Figure 11, Figure12 and Figure 13 in Part 1 where Rule 3 is applied.
In Figure 11, “xab” is added in tree and in Figure 12 (Phase 4), we add next character “x”.
In this, 3 extensions are done (which adds 3 suffixes). Last suffix “x” is already present in
tree.
In Figure 13, we add character “a” in tree (Phase 5). First 3 suffixes are added in tree and
last two suffixes “xa” and “a” are already present in tree. This shows that if suffix S[j..i]
present in tree, then ALL the remaining suffixes S[j+1..i], S[j+2..i], S[j+3..i],…, S[i..i] will
also be there in tree and no work needed to add those remaining suffixes.
So no more work needed to be done in any phase as soon as rule 3 applies in any extension
in that phase. If a new internal node v gets created in extension j and rule 3 applies in next
extension j+1, then we need to add suffix link from node v to current node (if we are on
internal node) or root node. ActiveNode, which will be discussed in part 3, will help while
setting suffix links.
387
Chapter 55. Ukkonen’s Suffix Tree Construction – Part 2
Trick 2
Stop the processing of any phase as soon as rule 3 applies. All further extensions are already
present in tree implicitly.
Observation 2: Once a leaf, always a leaf
Once a leaf is created and labelled j (for suffix starting at position j in string S), then this
leaf will always be a leaf in successive phases and extensions. Once a leaf is labelled as j,
extension rule 1 will always apply to extension j in all successive phases.
Consider Figure 9 to Figure 14 in Part 1.
In Figure 10 (Phase 2), Rule 1 is applied on leaf labelled 1. After this, in all successive
phases, rule 1 is always applied on this leaf.
In Figure 11 (Phase 3), Rule 1 is applied on leaf labelled 2. After this, in all successive
phases, rule 1 is always applied on this leaf.
In Figure 12 (Phase 4), Rule 1 is applied on leaf labelled 3. After this, in all successive
phases, rule 1 is always applied on this leaf.
In any phase i, there is an initial sequence of consecutive extensions where rule 1 or rule 2
are applied and then as soon as rule 3 is applied, phase i ends.
Also rule 2 creates a new leaf always (and internal node sometimes).
If Ji represents the last extension in phase i when rule 1 or 2 was applied (i.e after ith phase,
there will be Ji leaves labelled 1, 2, 3, …, Ji ) , then Ji <= Ji+1
Ji will be equal to Ji+1 when there are no new leaf created in phase i+1 (i.e rule 3 is applied
in Ji+1 extension)
In Figure 11 (Phase 3), Rule 1 is applied in 1st two extensions and Rule 2 is applied in 3rd
extension, so here J3 = 3
In Figure 12 (Phase 4), no new leaf created (Rule 1 is applied in 1st 3 extensions and then
rule 3 is applied in 4th extension which ends the phase). Here J4 = 3 = J3
In Figure 13 (Phase 5), no new leaf created (Rule 1 is applied in 1st 3 extensions and then
rule 3 is applied in 4th extension which ends the phase). Here J5 = 3 = J4
Ji will be less than Ji+1 when few new leaves are created in phase i+1.
In Figure 14 (Phase 6), new leaf created (Rule 1 is applied in 1st 3 extensions and then rule
2 is applied in last 3 extension which ends the phase). Here J6 = 6 > J5
So we can see that in phase i+1, only rule 1 will apply in extensions 1 to Ji (which really
doesn’t need much work, can be done in constant time and that’s the trick 3), extension
Ji+1 onwards, rule 2 may apply to zero or more extensions and then finally rule 3, which
ends the phase.
Now edge labels are represented using two indices (start, end), for any leaf edge, end will
always be equal to phase number i.e. for phase i, end = i for leaf edges, for phase i+1, end
= i+1 for leaf edges.
Trick 3
In any phase i, leaf edges may look like (p, i), (q, i), (r, i), …. where p, q, r are starting
position of different edges and i is end position of all. Then in phase i+1, these leaf edges
will look like (p, i+1), (q, i+1), (r, i+1),…. This way, in each phase, end position has to
be incremented in all leaf edges. For this, we need to traverse through all leaf edges and
increment end position for them. To do same thing in constant time, maintain a global
index e and e will be equal to phase number. So now leaf edges will look like (p, e), (q, e),
(r, e).. In any phase, just increment e and extension on all leaf edges will be done. Figure
19 shows this.
388
Chapter 55. Ukkonen’s Suffix Tree Construction – Part 2
So using suffix links and tricks 1, 2 and 3, a suffix tree can be built in linear time.
Tree Tm could be implicit tree if a suffix is prefix of another. So we can add a $ terminal
symbol first and then run algorithm to get a true suffix tree (A true suffix tree contains all
suffixes explicitly). To label each leaf with corresponding suffix starting position (all leaves
are labelled as global index e), a linear time traversal can be done on tree.
At this point, we have gone through most of the things we needed to know to create suffix
tree using Ukkonen’s algorithm. In next Part 3, we will take string S = “abcabxabcd” as an
example and go through all the things step by step and create the tree. While building the
tree, we will discuss few more implementation issues which will be addressed by ActivePoints.
We will continue to discuss the algorithm in Part 4 and Part 5. Code implementation will
be discussed in Part 6.
References:
http://web.stanford.edu/~mjkay/gusfield.pdf
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-2/
389
Chapter 56
Concatenation of the edge-labels on the path from the root to leaf i gives the suffix of S that
starts at position i, i.e. S[i…m].
Note: Position starts with 1 (it’s not zero indexed, but later, while code implementation,
we will used zero indexed position)
390
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
For string S = xabxac with m = 6, suffix tree will look like following:
It has one root node and two internal nodes and 6 leaf nodes.
String Depth of red path is 1 and it represents suffix c starting at position 6
String Depth of blue path is 4 and it represents suffix bxca starting at position 3
String Depth of green path is 2 and it represents suffix ac starting at position 5
String Depth of orange path is 6 and it represents suffix xabxac starting at position 1
Edges with labels a (green) and xa (orange) are non-leaf edge (which ends at an internal
node). All other edges are leaf edge (ends at a leaf)
If one suffix of S matches a prefix of another suffix of S (when last character in not unique
in string), then path for the first suffix would not end at a leaf.
For String S = xabxa, with m = 5, following is the suffix tree:
391
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
392
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
This takes O(m2 ) to build the suffix tree for the string S of length m.
Following are few steps to build suffix tree based for string “xabxa$” based on above algo-
rithm:
393
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
394
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
395
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
• Remove all terminal symbol $ from the edge labels of the tree,
• Remove any edge that has no label
• Remove any node that has only one edge going out of it and merge the edges.
396
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
397
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
398
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
399
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
400
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
401
Chapter 56. Ukkonen’s Suffix Tree Construction – Part 1
In next parts (Part 2, Part 3, Part 4 and Part 5), we will discuss suffix links, active points,
few tricks and finally code implementations (Part 6).
References:
http://web.stanford.edu/~mjkay/gusfield.pdf
This article is contributed by Anurag Singh. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-1/
402
Chapter 57
403
Chapter 57. Pattern Searching using a Trie of all Suffixes
Let us consider an example text “banana\0” where ‘\0’ is string termination character.
Following are all suffixes of “banana\0”
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
If we consider all of the above suffixes as individual words and build a Trie, we get following.
C++
404
Chapter 57. Pattern Searching using a Trie of all Suffixes
#include<list>
#define MAX_CHAR 256
using namespace std;
405
Chapter 57. Pattern Searching using a Trie of all Suffixes
/* Prints all occurrences of pat in the Suffix Trie S (built for text)*/
void SuffixTrie::search(string pat)
{
406
Chapter 57. Pattern Searching using a Trie of all Suffixes
return 0;
}
Java
import java.util.LinkedList;
import java.util.List;
class SuffixTrieNode {
407
Chapter 57. Pattern Searching using a Trie of all Suffixes
List<Integer> indexes;
SuffixTrieNode() // Constructor
{
// Create an empty linked list for indexes of
// suffixes starting from this node
indexes = new LinkedList<Integer>();
408
Chapter 57. Pattern Searching using a Trie of all Suffixes
409
Chapter 57. Pattern Searching using a Trie of all Suffixes
Output:
Time Complexity of the above search function is O(m+k) where m is length of the pattern
and k is the number of occurrences of pattern in text.
410
Chapter 57. Pattern Searching using a Trie of all Suffixes
This article is contributed by Ashish Anand. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above.
Improved By : smodi2007
Source
https://www.geeksforgeeks.org/pattern-searching-using-trie-suffixes/
411
Chapter 58
This problem is slightly different from standard pattern searching problem, here we need to
search for anagrams as well. Therefore, we cannot directly apply standard pattern searching
algorithms like KMP, Rabin Karp, Boyer Moore, etc.
A simple idea is to modify Rabin Karp Algorithm. For example we can keep the hash value
as sum of ASCII values of all characters under modulo of a big prime number. For every
character of text, we can add the current character to hash value and subtract the first
character of previous window. This solution looks good, but like standard Rabin Karp, the
worst case time complexity of this solution is O(mn). The worst case occurs when all hash
values match and we one by one match all characters.
412
Chapter 58. Anagram Substring Search (Or Search for all permutations)
We can achieve O(n) time complexity under the assumption that alphabet size is fixed which
is typically true as we have maximum 256 possible characters in ASCII. The idea is to use
two count arrays:
1) The first count array store frequencies of characters in pattern.
2) The second count array stores frequencies of characters in current window of text.
The important thing to note is, time complexity to compare two count arrays is O(1) as the
number of elements in them are fixed (independent of pattern and text sizes). Following are
steps of this algorithm.
1) Store counts of frequencies of pattern in first count array countP[]. Also store counts of
frequencies of characters in first window of text in array countTW[].
2) Now run a loop from i = M to N-1. Do following in loop.
…..a) If the two count arrays are identical, we found an occurrence.
…..b) Increment count of current character of text in countTW[]
…..c) Decrement count of first character in previous window in countWT[]
3) The last window is not checked by above loop, so explicitly check it.
Following is the implementation of above algorithm.
C++
413
Chapter 58. Anagram Substring Search (Or Search for all permutations)
(countP[pat[i]])++;
(countTW[txt[i]])++;
}
Java
414
Chapter 58. Anagram Substring Search (Or Search for all permutations)
return false;
return true;
}
415
Chapter 58. Anagram Substring Search (Or Search for all permutations)
Python3
MAX=256
M = len(pat)
N = len(txt)
for i in range(MAX):
countP.append(0)
countTW = []
for i in range(MAX):
countTW.append(0)
for i in range(M):
416
Chapter 58. Anagram Substring Search (Or Search for all permutations)
(countP[ ord(pat[i]) ]) += 1
(countTW[ ord(txt[i]) ]) += 1
Output:
Found at Index 0
Found at Index 5
Found at Index 6
This article is contributed by Piyush Gupta. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above
Source
https://www.geeksforgeeks.org/anagram-substring-search-search-permutations/
417
Chapter 59
0 banana 5 a
1 anana Sort the Suffixes 3 ana
2 nana ----------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5 a 2 nana
We have discussed Naive algorithm for construction of suffix array. The Naive algorithm
is to consider all suffixes, sort them using a O(nLogn) sorting algorithm and while sorting,
maintain original indexes. Time complexity of the Naive algorithm is O(n2 Logn) where n
is the number of characters in the input string.
In this post, a O(nLogn) algorithm for suffix array construction is discussed. Let us first
discuss a O(n * Logn * Logn) algorithm for simplicity. The idea is to use the fact that
strings that are to be sorted are suffixes of a single string.
We first sort all suffixes according to first character, then according to first 2 characters,
then first 4 characters and so on while the number of characters to be considered is smaller
than 2n. The important point is, if we have sorted suffixes according to first 2i characters,
then we can sort suffixes according to first 2i+1 characters in O(nLogn) time using a nLogn
sorting algorithm like Merge Sort. This is possible as two suffixes can be compared in O(1)
418
Chapter 59. Suffix Array | Set 2 (nLogn Algorithm)
time (we need to compare only two values, see the below example and code).
The sort function is called O(Logn) times (Note that we increase number of characters to
be considered in powers of 2). Therefore overall time complexity becomes O(nLognLogn).
See http://www.stanford.edu/class/cs97si/suffix-array.pdf for more details.
Let us build suffix array the example string “banana” using above algorithm.
Sort according to first two characters Assign a rank to all suffixes using ASCII value
of first character. A simple way to assign rank is to do “str[i] – ‘a”’ for ith suffix of strp[]
For every character, we also store rank of next adjacent character, i.e., the rank of character
at str[i + 1] (This is needed to sort the suffixes according to first 2 characters). If a character
is last character, we store next rank as -1
Sort all Suffixes according to rank and adjacent rank. Rank is considered as first digit or
MSD, and adjacent rank is considered as second digit.
419
Chapter 59. Suffix Array | Set 2 (nLogn Algorithm)
consider rank pair of suffix just before the current suffix. If previous rank pair of a suffix is
same as previous rank of suffix just before it, then assign it same rank. Otherwise assign
rank of previous suffix plus one.
For every suffix str[i], also store rank of next suffix at str[i + 2]. If there is no next suffix at
i + 2, we store next rank as -1
420
Chapter 59. Suffix Array | Set 2 (nLogn Algorithm)
421
Chapter 59. Suffix Array | Set 2 (nLogn Algorithm)
{
// If first rank and next ranks are same as that of previous
// suffix in array, assign the same new rank to this suffix
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else // Otherwise increment rank and assign
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
422
Chapter 59. Suffix Array | Set 2 (nLogn Algorithm)
{
char txt[] = "banana";
int n = strlen(txt);
int *suffixArr = buildSuffixArray(txt, n);
cout << "Following is suffix array for " << txt << endl;
printArr(suffixArr, n);
return 0;
}
Output:
Note that the above algorithm uses standard sort function and therefore time complexity is
O(nLognLogn). We can use Radix Sort here to reduce the time complexity to O(nLogn).
Please note that suffx arrays can be constructed in O(n) time also. We will soon be discussing
O(n) algorithms.
References:
http://www.stanford.edu/class/cs97si/suffix-array.pdf
http://www.cbcb.umd.edu/confcour/Fall2012/lec14b.pdf
Improved By : Akash Kumar 31
Source
https://www.geeksforgeeks.org/suffix-array-set-2-a-nlognlogn-algorithm/
423
Chapter 60
0 banana 5 a
1 anana Sort the Suffixes 3 ana
2 nana ----------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5 a 2 nana
424
Chapter 60. Suffix Array | Set 1 (Introduction)
425
Chapter 60. Suffix Array | Set 1 (Introduction)
Output:
The time complexity of above method to build suffix array is O(n2 Logn) if we consider a
O(nLogn) algorithm used for sorting. The sorting step itself takes O(n2 Logn) time as every
comparison is a comparison of two strings and the comparison takes O(n) time.
There are many efficient algorithms to build suffix array. We will soon be covering them as
separate posts.
Search a pattern using the built Suffix Array
To search a pattern in a text, we preprocess the text and build a suffix array of the text.
Since we have a sorted array of all suffixes, Binary Search can be used to search. Following
is the search function. Note that the function doesn’t report all occurrences of pattern, it
only report one of them.
// This code only contains search() and main. To make it a complete running
// above code or see https://ide.geeksforgeeks.org/oY7OkD
426
Chapter 60. Suffix Array | Set 1 (Introduction)
return 0;
}
Output:
427
Chapter 60. Suffix Array | Set 1 (Introduction)
The time complexity of the above search function is O(mLogn). There are more efficient
algorithms to search pattern once the suffix array is built. In fact there is a O(m) suffix
array based algorithm to search a pattern. We will soon be discussing efficient algorithm
for search.
Applications of Suffix Array
Suffix array is an extremely useful data structure, it can be used for a wide range of problems.
Following are some famous problems where Suffix array can be used.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string
See this for more problems where Suffix arrays can be used.
This post is a simple introduction. There is a lot to cover in Suffix arrays. We have discussed
a O(nLogn) algorithm for Suffix Array construction here. We will soon be discussing more
efficient suffix array algorithms.
References:
http://www.stanford.edu/class/cs97si/suffix-array.pdf
http://en.wikipedia.org/wiki/Suffix_array
Source
https://www.geeksforgeeks.org/suffix-array-set-1-introduction/
428
Chapter 61
For example, “g*ks” matches with “geeks” match. And string “ge?ks*” matches with “geeks-
forgeeks” (note ‘*’ at the end of first string). But “g*k” doesn’t match with “gee” as character
‘k’ is not present in second string.
C++
429
Chapter 61. String matching where one string contains wildcard characters
Python
430
Chapter 61. String matching where one string contains wildcard characters
return False
# Driver program
test("g*ks", "geeks") # Yes
test("ge?ks*", "geeksforgeeks") # Yes
test("g*k", "gee") # No because 'k' is not in second
test("*pqrs", "pqrst") # No because 't' is not in first
test("abc*bcd", "abcdhghgbcd") # Yes
test("abc*c?d", "abcd") # No because second must have 2 instances of 'c'
test("*c*d", "abcd") # Yes
test("*?c*d", "abcd") # Yes
Output:
Yes
Yes
No
No
431
Chapter 61. String matching where one string contains wildcard characters
Yes
No
Yes
Yes
Exercise
1) In the above solution, all non-wild characters of first string must be there is second string
and all characters of second string must match with either a normal character or wildcard
character of first string. Extend the above solution to work like other pattern searching
solutions where the first string is pattern and second string is text and we should print all
occurrences of first string in second.
2) Write a pattern searching function where the meaning of ‘?’ is same, but ‘*’ means 0 or
more occurrences of the character just before ‘*’. For example, if first string is ‘a*b’, then
it matches with ‘aaab’, but doesn’t match with ‘abb’.
This article is compiled by Vishal Chaudhary and reviewed by GeeksforGeeks team. Please
write comments if you find anything incorrect, or you want to share more information about
the topic discussed above.
Source
https://www.geeksforgeeks.org/wildcard-character-matching/
432
Chapter 62
433
Chapter 62. Pattern Searching using Suffix Tree
Following is the compressed trie. Compress Trie is obtained from standard trie by joining
chains of single nodes. The nodes of a compressed trie can be stored by storing index
ranges at the nodes.
434
Chapter 62. Pattern Searching using Suffix Tree
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
If we consider all of the above suffixes as individual words and build a trie, we get following.
If we join chains of single nodes, we get the following compressed trie, which is the Suffix
435
Chapter 62. Pattern Searching using Suffix Tree
Please note that above steps are just to manually create a Suffix Tree. We will be discussing
actual algorithm and implementation in a separate post.
How to search a pattern in the built suffix tree?
We have discussed above how to build a Suffix Tree which is needed as a preprocessing step
in pattern searching. Following are abstract steps to search a pattern in the built Suffix
Tree.
1) Starting from the first character of the pattern and root of Suffix Tree, do following for
every character.
…..a) For the current character of pattern, if there is an edge from the current node of suffix
tree, follow the edge.
…..b) If there is no edge, print “pattern doesn’t exist in text” and return.
2) If all characters of pattern have been processed, i.e., there is a path from root for char-
acters of the given pattern, then print “Pattern found”.
Let us consider the example pattern as “nan” to see the searching process. Following diagram
shows the path followed for searching “nan” or “nana”.
436
Chapter 62. Pattern Searching using Suffix Tree
Source
https://www.geeksforgeeks.org/pattern-searching-using-suffix-tree/
437
Chapter 63
In this post, we will discuss Boyer Moore pattern searching algorithm. Like KMPand Finite
Automataalgorithms, Boyer Moore algorithm also preprocesses the pattern.
Boyer Moore is a combination of following two approaches.
1) Bad Character Heuristic
2) Good Suffix Heuristic
Both of the above heuristics can also be used independently to search a pattern in a text.
Let us first understand how two independent approaches work together in the Boyer Moore
438
Chapter 63. Boyer Moore Algorithm for Pattern Searching
algorithm. If we take a look at the Naive algorithm, it slides the pattern over the text one
by one. KMP algorithm does preprocessing over the pattern so that the pattern can be
shifted by more than one. The Boyer Moore algorithm does preprocessing for the same
reason. It preporcesses the pattern and creates different arrays for both heuristics. At
every step, it slides the pattern by max of the slides suggested by the two heuristics. So it
uses best of the two heuristics at every step.
Unlike the previous pattern searching algorithms, Boyer Moore algorithm starts
matching from the last character of the pattern.
In this post, we will discuss bad character heuristic, and discuss Good Suffix heuristic in
the next post.
Bad Character Heuristic
The idea of bad character heuristic is simple. The character of the text which doesn’t match
with the current character of pattern is called the Bad Character. Upon mismatch we
shift the pattern until –
1) The mismatch become a match
2) Pattern P move past the mismatch character.
Case 1 – Mismatch become match
We will lookup the position of last occurence of mismatching character in pattern and if
mismatching character exist in pattern then we’ll shift the pattern such that it get aligned
to the mismatching character in text T.
case 1
Explanation: In the above example, we got a mismatch at position 3. Here our mismatch-
ing character is “A”. Now we will search for last occurence of “A” in pattern. We got “A”
at position 1 in pattern (displayed in Blue) and this is the last occurence of it. Now we will
shift pattern 2 times so that “A” in pattern get aligned with “A” in text.
Case 2 – Pattern move past the mismatch character
We’ll lookup the position of last occurence of mismatching character in pattern and if
439
Chapter 63. Boyer Moore Algorithm for Pattern Searching
character does not exist we will shift pattern past the mismatching character.
case2
Explanation: Here we have a mismatch at position 7. The mismatching character “C”
does not exist in pattern before position 7 so we’ll shift pattern past to the position 7 and
eventually in above example we have got a perfect match of pattern (displayed in Green).
We are doing this because, “C” do not exist in pattern so at every shift before position 7 we
will get mismatch and our search will be fruitless.
In following implementation, we preprocess the pattern and store the last occurrence of
every possible character in an array of size equal to alphabet size. If the character is not
present at all, then it may result in a shift by m (length of pattern). Therefore, the bad
440
Chapter 63. Boyer Moore Algorithm for Pattern Searching
int badchar[NO_OF_CHARS];
441
Chapter 63. Boyer Moore Algorithm for Pattern Searching
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
Python
NO_OF_CHARS = 256
442
Chapter 63. Boyer Moore Algorithm for Pattern Searching
badChar = [-1]*NO_OF_CHARS
'''
Shift the pattern so that the next character in text
aligns with the last occurrence of it in pattern.
The condition s+m < n is necessary for the case when
pattern occurs at the end of text
'''
s += (m-badChar[ord(txt[s+m])] if s+m<n else 1)
else:
'''
Shift the pattern so that the bad character in text
443
Chapter 63. Boyer Moore Algorithm for Pattern Searching
if __name__ == '__main__':
main()
Output:
The Bad Character Heuristic may take time in worst case. The worst
case occurs when all characters of the text and pattern are same. For example, txt[] =
“AAAAAAAAAAAAAAAAAA” and pat[] = “AAAAA”.
Boyer Moore Algorithm | Good Suffix heuristic
This article is co-authored by Atul Kumar. Please write comments if you find anything
incorrect, or you want to share more information about the topic discussed above.
Source
https://www.geeksforgeeks.org/boyer-moore-algorithm-for-pattern-searching/
444
Chapter 64
445
Chapter 64. Pattern Searching | Set 6 (Efficient Construction of Finite Automata)
The above diagrams represent graphical and tabular representations of pattern ACACAGA.
Algorithm:
1) Fill the first row. All entries in first row are always 0 except the entry for pat[0] character.
For pat[0] character, we always need to go to state 1.
2) Initialize lps as 0. lps for the first index is always 0.
3) Do following for rows at index i = 1 to M. (M is the length of the pattern)
…..a) Copy the entries from the row at index equal to lps.
…..b) Update the entry for pat[i] character to i+1.
…..c) Update lps “lps = TF[lps][pat[i]]” where TF is the 2D array which is being constructed.
Implementation
Following is C implementation for the above algorithm.
#include<stdio.h>
#include<string.h>
#define NO_OF_CHARS 256
/* This function builds the TF table which represents Finite Automata for a
given pattern */
void computeTransFun(char *pat, int M, int TF[][NO_OF_CHARS])
{
int i, lps = 0, x;
446
Chapter 64. Pattern Searching | Set 6 (Efficient Construction of Finite Automata)
TF[i][pat[i]] = i + 1;
int TF[M+1][NO_OF_CHARS];
computeTransFun(pat, M, TF);
Output:
447
Chapter 64. Pattern Searching | Set 6 (Efficient Construction of Finite Automata)
Source
https://www.geeksforgeeks.org/pattern-searching-set-5-efficient-constructtion-of-finite-automata/
448
Chapter 65
449
Chapter 65. Finite Automata algorithm for Pattern Searching
Automata. Construction of the FA is the main tricky part of this algorithm. Once the FA
is built, the searching is simple. In search, we simply need to start from the first state of
the automata and the first character of the text. At every step, we consider next character
of text, look for the next state in the built FA and move to a new state. If we reach the
final state, then the pattern is found in the text. The time complexity of the search process
is O(n).
Before we discuss FA construction, let us take a look at the following FA for pattern
ACACAGA.
The above diagrams represent graphical and tabular representations of pattern ACACAGA.
Number of states in FA will be M+1 where M is length of the pattern. The main thing
to construct FA is to get the next state from the current state for every possible character.
Given a character x and a state k, we can get the next state by considering the string
“pat[0..k-1]x” which is basically concatenation of pattern characters pat[0], pat[1] … pat[k-1]
and the character x. The idea is to get length of the longest prefix of the given pattern such
that the prefix is also suffix of “pat[0..k-1]x”. The value of length gives us the next state.
For example, let us see how to get the next state from current state 5 and character ‘C’ in
the above diagram. We need to consider the string, “pat[0..4]C” which is “ACACAC”. The
length of the longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4
(“ACAC”). So the next state (from state 5) is 4 for character ‘C’.
In the following code, computeTF() constructs the FA. The time complexity of the
computeTF() is O(m^3*NO_OF_CHARS) where m is length of the pattern and
NO_OF_CHARS is size of alphabet (total number of possible characters in pattern and
text). The implementation tries all possible prefixes starting from the longest possible
450
Chapter 65. Finite Automata algorithm for Pattern Searching
return 0;
}
451
Chapter 65. Finite Automata algorithm for Pattern Searching
int TF[M+1][NO_OF_CHARS];
computeTF(pat, M, TF);
Java
452
Chapter 65. Finite Automata algorithm for Pattern Searching
// increment state
if(state < M && x == pat[state])
return state + 1;
return 0;
}
computeTF(pat, M, TF);
453
Chapter 65. Finite Automata algorithm for Pattern Searching
// Driver code
public static void main(String[] args)
{
char[] pat = "AABAACAADAABAAABAA".toCharArray();
char[] txt = "AABA".toCharArray();
search(txt,pat);
}
}
Python
NO_OF_CHARS = 256
i=0
# ns stores the result which is next state
454
Chapter 65. Finite Automata algorithm for Pattern Searching
for ns in range(state,0,-1):
if ord(pat[ns-1]) == x:
while(i<ns-1):
if pat[i] != pat[state-ns+1+i]:
break
i+=1
if i == ns-1:
return ns
return 0
return TF
455
Chapter 65. Finite Automata algorithm for Pattern Searching
search(pat, txt)
if __name__ == '__main__':
main()
Output:
References:
Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest,
Clifford Stein
Improved By : debjitdbb
Source
https://www.geeksforgeeks.org/finite-automata-algorithm-for-pattern-searching/
456
Chapter 66
457
Chapter 66. Optimized Naive Algorithm for Pattern Searching
while (i <= N - M)
{
int j;
Python
458
Chapter 66. Optimized Naive Algorithm for Pattern Searching
PHP
<?php
// PHP program for A modified Naive
// Pattern Searching algorithm that
// is optimized for the cases when all
// characters of pattern are different
// if pat[0...M-1] =
// txt[i, i+1, ...i+M-1]
if ($j == $M)
{
echo("Pattern found at index $i"."\n" );
$i = $i + $M;
459
Chapter 66. Optimized Naive Algorithm for Pattern Searching
}
else if ($j == 0)
$i = $i + 1;
else
// Driver Code
$txt = "ABCEABCDABCEABCD";
$pat = "ABCD";
search($pat, $txt);
Output:
Source
https://www.geeksforgeeks.org/optimized-naive-algorithm-for-pattern-searching/
460
Chapter 67
The Naive String Matching algorithm slides the pattern one by one. After each slide, it
one by one checks characters at the current shift and if all characters match then prints the
match.
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But
unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern
with the hash value of current substring of text, and if the hash values match then only
it starts matching individual characters. So Rabin Karp algorithm needs to calculate hash
values for following strings.
1) Pattern itself.
2) All the substrings of text of length m.
461
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
Since we need to efficiently calculate hash values for all the substrings of size m of text, we
must have a hash function which has following property.
Hash at the next shift must be efficiently computable from the current hash value and next
character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently computable from
hash(txt[s .. s+m-1]) and txt[s+m] i.e., hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s
.. s+m-1]) and rehash must be O(1) operation.
The hash function suggested by Rabin and Karp calculates an integer value. The integer
value for a string is numeric value of a string. For example, if all possible characters are from
1 to 10, the numeric value of “122” will be 122. The number of possible characters is higher
than 10 (256 in general) and pattern length can be large. So the numeric values cannot be
practically stored as an integer. Therefore, the numeric value is calculated using modular
arithmetic to make sure that the hash values can be stored in an integer variable (can fit in
memory words). To do rehashing, we need to take off the most significant digit and add the
new least significant digit for in hash value. Rehashing is done using the following formula.
hash( txt[s+1 .. s+m] ) = ( d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q
hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)
C/C++
462
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
463
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
Java
464
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
if ( p == t )
{
/* Check for characters one by one */
for (j = 0; j < M; j++)
{
if (txt.charAt(i+j) != pat.charAt(j))
break;
}
Python
465
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
j+=1
# if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]
if j==M:
print "Pattern found at index " + str(i)
466
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
PHP
<?php
// Following program is a PHP
// implementation of Rabin Karp
// Algorithm given in the CLRS book
467
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
// if p == t and pat[0...M-1] =
// txt[i, i+1, ...i+M-1]
if ($j == $M)
echo "Pattern found at index ",
$i, "\n";
}
468
Chapter 67. Rabin-Karp Algorithm for Pattern Searching
// Driver Code
$txt = "GEEKS FOR GEEKS";
$pat = "GEEK";
$q = 101; // A prime number
search($pat, $txt, $q);
Output:
The average and best case running time of the Rabin-Karp algorithm is O(n+m), but its
worst-case time is O(nm). Worst case of Rabin-Karp algorithm occurs when all characters
of pattern and text are same as the hash values of all the substrings of txt[] match with
hash value of pat[]. For example pat[] = “AAA” and txt[] = “AAAAAAA”.
References:
http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm
http://www.cs.princeton.edu/courses/archive/fall04/cos226/lectures/string.4up.pdf
http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
Related Posts:
Searching for Patterns | Set 1 (Naive Pattern Searching)
Searching for Patterns | Set 2 (KMP Algorithm)
Improved By : Hao Lee, jit_t
Source
https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/
469
Chapter 68
470
Chapter 68. KMP Algorithm for Pattern Searching
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns
appearing more than once in the pattern) of the pattern and improves the worst case com-
plexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch
(after some matches), we already know some of the characters in the text of the next window.
We take advantage of this information to avoid matching the characters that we know will
anyway match. Let us consider below example to understand this.
Matching Overview
txt = "AAAAABAAABA"
pat = "AAAA"
Need of Preprocessing?
An important question arises from the above explanation,
how to know how many characters to be skipped. To know this,
we pre-process pattern and prepare an integer array
lps[] that tells us the count of characters to be skipped.
Preprocessing Overview:
• KMP algorithm preprocesses pat[] and constructs an auxiliary lps[] of size m (same
as size of pattern) which is used to skip characters while matching.
• name lps indicates longest proper prefix which is also suffix.. A proper prefix
is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”,
“AB” and “ABC”. Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”,
“C”, “BC” and “ABC”.
471
Chapter 68. KMP Algorithm for Pattern Searching
• For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum
matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
Searching Algorithm:
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at
each shift, we use a value from lps[] to decide the next characters to be matched. The idea
is to not match a character that we know will anyway match.
How to use lps[] to decide next positions (or to know a number of characters to be skipped)?
txt[] = "AAAAABAAABA"
472
Chapter 68. KMP Algorithm for Pattern Searching
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}
i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++
i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 4, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
473
Chapter 68. KMP Algorithm for Pattern Searching
i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1
i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0
i = 5, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.
i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
C++
474
Chapter 68. KMP Algorithm for Pattern Searching
computeLPSArray(pat, M, lps);
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
}
475
Chapter 68. KMP Algorithm for Pattern Searching
if (len != 0) {
len = lps[len - 1];
Java
class KMP_String_Matching {
void KMPSearch(String pat, String txt)
{
int M = pat.length();
int N = txt.length();
476
Chapter 68. KMP Algorithm for Pattern Searching
i++;
}
if (j == M) {
System.out.println("Found pattern "
+ "at index " + (i - j));
j = lps[j - 1];
}
477
Chapter 68. KMP Algorithm for Pattern Searching
i++;
}
}
}
}
Python
if j == M:
print "Found pattern at index " + str(i-j)
j = lps[j-1]
478
Chapter 68. KMP Algorithm for Pattern Searching
i += 1
txt = "ABABDABACDABABCABAB"
pat = "ABABCABAB"
KMPSearch(pat, txt)
C#
class GFG {
479
Chapter 68. KMP Algorithm for Pattern Searching
480
Chapter 68. KMP Algorithm for Pattern Searching
// to search step.
if (len != 0) {
len = lps[len - 1];
Output:
Preprocessing Algorithm:
In the preprocessing part, we calculate values in lps[]. To do that, we keep track of the length
of the longest prefix suffix value (we use len variable for this purpose) for the previous index.
We initialize lps[0] and len as 0. If pat[len] and pat[i] match, we increment len by 1 and
assign the incremented value to lps[i]. If pat[i] and pat[len] do not match and len is not 0,
we update len to lps[len-1]. See computeLPSArray () in the below code for details.
Illustration of preprocessing (or construction of lps[])
pat[] = "AAACAAAA"
len = 0, i = 0.
lps[0] is always 0, we move
to i = 1
len = 0, i = 1.
481
Chapter 68. KMP Algorithm for Pattern Searching
len = 1, i = 2.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3
len = 2, i = 3.
Since pat[len] and pat[i] do not match, and len > 0,
set len = lps[len-1] = lps[1] = 1
len = 1, i = 3.
Since pat[len] and pat[i] do not match and len > 0,
len = lps[len-1] = lps[0] = 0
len = 0, i = 3.
Since pat[len] and pat[i] do not match and len = 0,
Set lps[3] = 0 and i = 4.
len = 0, i = 4.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5
len = 1, i = 5.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6
len = 2, i = 6.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7
len = 3, i = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2
len = 2, i = 7.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8
482
Chapter 68. KMP Algorithm for Pattern Searching
Source
https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
483
Chapter 69
484
Chapter 69. Naive algorithm for Pattern Searching
#include <string.h>
[/sourcecode]
Java
int j;
485
Chapter 69. Naive algorithm for Pattern Searching
C#
class GFG
{
486
Chapter 69. Naive algorithm for Pattern Searching
// Driver code
public static void Main()
{
String txt = "AABAACAADAABAAABAA";
String pat = "AABA";
search(txt, pat);
}
}
// This code is Contributed by Sam007
PHP
<?php
// PHP program for Naive Pattern
// Searching algorithm
// if pat[0...M-1] =
// txt[i, i+1, ...i+M-1]
if ($j == $M)
echo "Pattern found at index ", $i."\n";
}
}
// Driver Code
$txt = "AABAACAADAABAAABAA";
$pat = "AABA";
search($pat, $txt);
487
Chapter 69. Naive algorithm for Pattern Searching
Output:
txt[] = "AABCCAADDEE";
pat[] = "FAA";
txt[] = "AAAAAAAAAAAAAAAAAA";
pat[] = "AAAAA";
2) Worst case also occurs when only the last character is different.
txt[] = "AAAAAAAAAAAAAAAAAB";
pat[] = "AAAAB";
The number of comparisons in the worst case is O(m*(n-m+1)). Although strings which
have repeated characters are not likely to appear in English text, they may well occur in
other applications (for example, in binary texts). The KMP matching algorithm improves
the worst case to O(n). We will be covering KMP in the next post. Also, we will be writing
more posts to cover all pattern searching algorithms and data structures.
Improved By : Sam007, Brij Raj Kishore
Source
https://www.geeksforgeeks.org/naive-algorithm-for-pattern-searching/
488