Shortest Common Superstring1

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 14

PROBLEM DEFINITION :

Find the shortest string S which contains each Si as a substring of S.

INPUT OUTPUT

PROBLEM DESCRIPTION :
The shortest superstring problem takes as input, several strings of different
lengths and finds the shortest common string that contains all the input strings as
substrings. This is helpful in the genome project since it will allow researchers to
determine entire coding regions from a collection of fragmented sections. Shortest
common superstring arises in a variety of applications, including sparse matrix
compression. Suppose we have an (n x m) matrix with most of the elements being zero.
We can partition each row into (m / k) runs of k elements each and construct the shortest
common superstring S' of these runs. We now have reduced the problem to storing the
superstring, plus an (n x m / k) array of pointers into the superstring denoting where
each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but
there is a space savings when |S| << mn.

INPUT DESCRIPTION :
Given a set of n strings, S = {S1,...,Sn}, we want to find the shortest string s that contains
Si as a substring.

OUTPUT DESCRIPTION :
The output of this problem is the shortest common superstring from the given set of
substrings and printing the substrings in a shifted fashion whenever a match is
encountered.

ASSUMPTIONS :
We assume that no Si belongs to S is a substring of Sj belongs to S. This problem is
NP-hard. Such a problem scales up exponentially and consequently large instances
cannot be solved in real life time by electronic computers.

1
TECNIQUES THAT CAN BE APPLIED:

1. GREEDY HEURISTIC METHOD :


The Greedy Heuristic method provides the standard approach to approximating
Shortest Common Superstring.

ALGORITHM:

Step 1 : Input set of strings. S= { S1, S2 ..., Sn }.

Step 2 : Identification of which pair of string have maximum overlap for every pair by
using Brute – Force Algorithm or Knuth Morris Pratt Algorithm.

Step 3 : Replace the pair of strings with maximum overlap by a merge string until only
one string remains.

Step 4 : Output the string with the superstring in one line and approximately shifting the
substring to the right after a mismatch.

2. USING TRAVELLING SALESMAN PROBLEM APPROACH :


This is one of the most well known difficult problems of time. A salesperson must
visit n cities, passing through each city only once, beginning from one of the city that is
considered as a base or starting city and returns to it. The cost of the transportation
among the cities is given. The problem is to find the order of minimum cost route that is,
the order of visiting the cities in such a way that the cost is the minimum.

To solve the above problem using TSP we have to do the following operations:

1. Create an overlap graph G where vertex Vi represents string Si.


2. Assign edge (vi,vj) weight equal to the length of Si minus the overlap of Sj with Si.
Thus weight W(vi,vj) = 1 for Sj=abc and Sj =bcd.
3. The minimum weight path visiting all the vertices defines the SCS. These edge
weights are not symmetric.
4. For the above problem W(vi,vj)=3 for the 1st two strings S1=ABRAC and
S2=ACADA.
5. Now the TSP is applied.

ALGORITHM TSP:

Step 1: First, find out all (n -1)! Possible solutions, where n is the number of string inputs.

Step 2 ; Next, determine the minimum cost by finding out the cost of everyone of these
(n -1)! solutions.

Step 3 : Finally, keep the one with the minimum cost.

2
3. THE SET COVER ALGORITHM APPROACH :
Using the set cover method, we obtain a 2Hn factor approximation algorithm.
Given input, S = {S1,...,Sn}, we construct a string rijk for all possible combinations Si and
Sj belongs to S (where k is the maximum overlap between the two). Now, let’s call the
set of all such r, R. Now let v belong to given set, such that sub(v) = {s belongs to S| s is
a substring of v}. All possible subsets of S are sub(v) for all v belongs to S U R.

ALGORITHM (SET COVER):

Step 1 : Use the greedy set cover algorithm to find a cover for the instance C.
Step 2 : Backwards construct v1, ...vk from the sets selected by the algorithm so that
sub(v1)U...U
sub(vk) is the cover for C.
Step 3 : Uniting the strings v1, ...vk gives the shortest superstring via set cover.

4. KRUSKAL’S MAXIMUM SPANNING TREE ALGORITHM :


We can solve the problem also by finding the Maximum Spanning Tree using
Kruskal Algorithm by creating a graph G of the given set of strings. T represents the
Tree.

ALGORITHM :

One method for computing the maximum weight spanning tree of a network G –
due to Kruskal can be summarized as follows.

Step 1 : Sort the edges of G into decreasing order by weight. Let T be the set of edges
comprising the maximum weight spanning tree. Set T = NULL.

Step 2 : Add the first edge to T.

Step 3 : Add the next edge to T if and only if it does not form a cycle in T. If there are no
remaining edges exit and report G to be disconnected.

Step 4 : If T has n−1 edges (where n is the number of vertices in G) stop and output T.
Otherwise go to step 3.

3
OUR LOGICAL APPROACH :
We begin our approach by taking ‘n’ substrings from the user and storing them in
a 2D array. The user may enter a maximum of 10 substrings, which is the boundary
condition of the program. We have implemented the program, using various structures.
Firstly, after the inputs are encountered we use the structure ‘matrix’ which keeps the
record of common character between each pair of substring. We also use a structure
‘edgelist’ to represent each substring as a vertex and the number of common characters
between pairs of substrings as edges. In this way, the whole structure is represented in
a form of a tree. Later we use Khuskal’s algorithm to form the maximal spanning tree,
with help of the structure ‘sequence’ which stores the edges in a non-increasing order.
Finally, a function is invoked which rearranges the vertices in an efficient way so that the
shortest common superstring can be formed.

DATA STRUCTURES USED :


We try to solve this problem simply using array data structure. We use a 2D
array ‘IS’ to store the input substrings and an 1D array ‘OS’ to store the shortest
common superstring which the required output. The reason that we have chosen array
as the primary data structure is that strings are most suitably represented using
character-array representation. It is also worth mentioning that the manipulation of
strings become easier as traversing an array with respect to array-indices reduces
excess overhead.

PROGRAM IMPLEMENTATION USING C-CODE :

/*Inclusion of Header Files*/


#include<stdio.h>
#include<conio.h>
#include<string.h>
#include<alloc.h>

/*declaration of global variables*/


char IS[10][10];/*input_string*/
char OS[50];/*output string*/
int total;/*total no of substrings*/
int length;/*length of a substring*/
int edge_count=0;/*no. of matches found*/
int sequence_count=0;/*no. of matches actually considered in formation of the output
string*/

/*declaration of global structures*/


struct string_matrix/*structure which keeps the record of common character between
each pair of substring*/
{
int value[10];
}matrix[10];

struct edgelist/*structure which stores the non-zero entries of the matrix*/


{
int u,v,weight;
}edgelist[10];

4
struct sequence/*structure which holds the maximal spanning tree*/
{
int u,v,weight;
}sequence[10];

struct dummy_sequence
{
int u,v,weight;
}dseq[10];

/*declaration of global function*/


void display_sub_strings(void);
void create_matrix(void);
void display_matrix(void);
void create_edgelist(void);
void display_edgelist(void);
void arrange_edgelist(void);
void create_sequence(void);
int check_cycle(int);
void arrange_sequence(void);
void display_sequence(void);
void create_super_string(void);
void display_super_string(void);

void main()
{
int i,j;

printf("ENTER THE TOTAL NO. OF SUBSTRINGS : ");


scanf("%d",&total);
printf("ENTER %d SUBSTRINGS (each terminated by an enter)
EACH SUBSTRING MUST BE OF SAME LENGTH : ",total);
for(i=0;i<=total;i++)
gets(IS[i]);
length=strlen(IS[1]);

/*initialization of string_matrix*/
for(i=1;i<=10;i++)
for(j=1;j<=10;j++)
matrix[i].value[j]=0;

display_sub_strings();
create_matrix();
display_matrix();
create_edgelist();
arrange_edgelist();
display_edgelist();
create_sequence();
arrange_sequence();
display_sequence();

5
create_super_string();
display_super_string();
}/*end of main*/

/*definition of global functions*/


/*function to display each substring entered*/
void display_sub_strings(void)
{
int i;
printf("ENTERED SUBSTRINGS : ");
printf("-------------------------------------");
for(i=1;i<=total;i++)
{
printf("IS[%d] = ",i);
puts(IS[i]);
}
}/*end of function*/

/*function to create the string_matrix*/


void create_matrix(void)
{
int i,j,k,l,flag;
length=strlen(IS[1]);
for(i=1;i<=total;i++)
{
flag=0;
for(j=1;j<=total;j++)
{
for(k=0;k<length;k++)
{
if(IS[i][k]==IS[j][0])
{
l=1;
k++;
flag=1;
while(k<length && (IS[i][k]==IS[j][l]))
{
l++;
k++;
}
}
if((IS[i][k]!=IS[j][l]) && (k!=length))
flag=0;
}
if(flag && i!=j)/*match found for last 'l' characters of i-th string*/
matrix[i].value[j]=l;
}
}
}/*end of function*/

/*function to display the string_matrix*/

6
void display_matrix(void)
{
int i,j;
printf("MATRIX : ");
printf("--------------");
printf("Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j");
for(i=1;i<=total;i++)
printf("IS[%d]",i);
for(i=1;i<=total;i++)
{
printf("IS[%d]",i);
for(j=1;j<=total;j++)
printf("%d",matrix[i].value[j]);
printf("\n");
}
}/*end of function*/

/*function to create the edge_list*/


void create_edgelist(void)
{
int i,j;
for(i=1;i<=total;i++)
{
for(j=1;j<=total;j++)
{
if(matrix[i].value[j])
{
edge_count++;
edgelist[edge_count].u=i;
edgelist[edge_count].v=j;
edgelist[edge_count].weight=matrix[i].value[j];
}
}
}
}/*end of function*/

/*function to arrange the edge_list in non-increasing order, implemented bubble sort*/


void arrange_edgelist(void)
{
int i,flag=1,j,temp;
for(i=1;i<=edge_count && flag;i++)
{
j=edge_count;
flag=0;
while(j>i)
{
if(edgelist[j].weight > edgelist[j-1].weight)
{
temp=edgelist[j].weight;
edgelist[j].weight=edgelist[j-1].weight;

7
edgelist[j-1].weight=temp;
temp=edgelist[j].u;
edgelist[j].u=edgelist[j-1].u;
edgelist[j-1].u=temp;
temp=edgelist[j].v;
edgelist[j].v=edgelist[j-1].v;
edgelist[j-1].v=temp;
flag=1;
}
j--;
}
}
}/*end of function*/

/*function to display the edge_list*/


void display_edgelist(void)
{
int i;
printf("EDGELIST : ");
printf("-----------------");
printf("Here The Non-zero Entries of The above Matrix is Represented in Form of a
Edgelist");
printf("VERTEX 1 VERTEX 2 EDGE");
for(i=1;i<=edge_count;i++)
printf("IS[%d] IS[%d] %d",edgelist[i].u,edgelist[i].v,edgelist[i].weight);
}/*end of function*/

/*function to create the maximal spanning tree using kruskal algorithim*/


void create_sequence(void)
{
int i=1,flag;
while((sequence_count < total-1) && (i<=edge_count))
{
flag=check_cycle(edgelist[i].u);
if(!flag)
{
sequence_count++;
sequence[sequence_count].u=edgelist[i].u;
sequence[sequence_count].v=edgelist[i].v;
sequence[sequence_count].weight=edgelist[i].weight;
}
i++;
}
}/*end of function*/

/*function to check whether inclusion of a edge form a cycle in the tree*/


int check_cycle(int u)
{
int i;
for(i=1;i<=sequence_count;i++)
if(sequence[i].u==u)

8
return(1);
return(0);
}/*end of function*/

/*function to form the final sequence of substrings*/


void arrange_sequence(void)
{
int flag,i,j,k,store_i;
for(i=1;i<=sequence_count;i++)
{
k=0;
flag=0;
for(j=1;j<=sequence_count && (flag!=sequence_count-1);j++)
{
if(sequence[i].v==sequence[j].u)
{
k++;
dseq[k].u= sequence[i].u;
dseq[k].v= sequence[i].v;
dseq[k].weight= sequence[i].weight;
store_i=i;
i=j;
flag++;
j=0;
}
}
if(flag==sequence_count-1)
{
k++;
dseq[k].u= sequence[i].u;
dseq[k].v= sequence[i].v;
dseq[k].weight= sequence[i].weight;
/*copy into sequence*/
for(i=1;i<=sequence_count;i++)
{
sequence[i].u=dseq[i].u;
sequence[i].v=dseq[i].v;
sequence[i].weight=dseq[i].weight;
}
return;
}
if(flag)
i=store_i;
}
}/*end of function*/

/*function to display the final sequence of substrings*/


void display_sequence(void)
{
int i;
printf("SEQUENCE : ");

9
printf("--------");
printf("Here we Represent The Maximal Spanning Tree in Form of a List : ");
printf("VERTEX 1 VERTEX 2 EDGE");
for(i=1;i<=sequence_count;i++)
printf("IS[%d] IS[%d] %d",sequence[i].u,sequence[i].v,sequence[i].
weight);
}/*end of function*/

/*function to form the shortest common string*/


void create_super_string(void)
{
int i,j,k;
for(i=0,j=0;i<length;i++,j++)
OS[j]=IS[sequence[1].u][i];
for(i=1;i<=sequence_count;i++)
{
for(k=sequence[i].weight;k<length;k++,j++)
{
OS[j]=IS[sequence[i].v][k];
}
}
}/*end of function*/

/*function to display the shortest common string*/


void display_super_string(void)
{
int i,j,k,count_blank=0;
printf("SHORTEST COMMON SUPERSTRING : ");
puts(OS);
printf("------------");
/*printing a formatted output*/
puts(IS[sequence[1].u]);
printf("\n");
for(i=1;i<=sequence_count;i++)
{
for(j=1;j<=count_blank;j++)
printf(" ");
for(k=sequence[i].weight;k<length;k++)
{
printf(" ");
count_blank++;
}
puts(IS[sequence[i].v]);
printf("\n");
}
}/*end of function*/

/*definition of global structures finished*/

10
OUTPUT INSTANCE 1 :

ENTER THE TOTAL NO. OF SUBSTRINGS : 5

ENTER 5 SUBSTRINGS (each terminated by an enter)


EACH SUBSTRING MUST BE OF SAME LENGTH :
ABRAC
ACADA
ADABR
DABRA
RACAD

ENTERED SUBSTRINGS :
------------------------------------
IS[1] = ABRAC
IS[2] = ACADA
IS[3] = ADABR
IS[4] = DABRA
IS[5] = RACAD

MATRIX :
-------------
Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j

IS[1] IS[2] IS[3] IS[4] IS[5]


IS[1] 0 2 0 0 3
IS[2] 1 0 3 2 0
IS[3] 3 0 0 4 1
IS[4] 4 1 1 0 2
IS[5] 0 4 2 1 0

EDGELIST :
-----------------
Here The Non-zero Entries of The above Matrix is Represented in Form of an Edgelist

VERTEX 1 VERTEX 2 EDGE


IS[3] IS[4] 4
IS[4] IS[1] 4
IS[5] IS[2] 4
IS[1] IS[5] 3
IS[2] IS[3] 3
IS[3] IS[1] 3
IS[1] IS[2] 2
IS[2] IS[4] 2
IS[4] IS[5] 2
IS[5] IS[3] 2
IS[2] IS[1] 1
IS[3] IS[5] 1
IS[4] IS[2] 1
IS[4] IS[3] 1

11
IS[5] IS[4] 1

SEQUENCE :
-------------------
Here we Represent The Maximal Spanning Tree in Form of a List :

VERTEX 1 VERTEX 2 EDGE


IS[3] IS[4] 4
IS[4] IS[1] 4
IS[1] IS[5] 3
IS[5] IS[2] 4

SHORTEST COMMON SUPERSTRING :

ADABRACADA
---------------------
ADABR
DABRA
ABRAC
RACAD
ACADA

OUTPUT INSTANCE 2 :

ENTER THE TOTAL NO. OF SUBSTRINGS : 4

ENTER 4 SUBSTRINGS (each terminated by an enter)


EACH SUBSTRING MUST BE OF SAME LENGTH :
ABCDE
BCDEF
DEFGH
CDEFG

ENTERED SUBSTRINGS :
------------------------------------
IS[1] = abcde
IS[2] = bcdef
IS[3] = defgh
IS[4] = cdefg

MATRIX :
-------------
Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j

IS[1] IS[2] IS[3] IS[4]


IS[1] 0 4 2 3
IS[2] 1 0 3 4
IS[3] 0 0 0 0
IS[4] 0 0 4 0

12
EDGELIST :
----------------
Here The Non-zero Entries of The above Matrix is Represented in Form of a
Edgelist

VERTEX 1 VERTEX 2 EDGE


IS[1] IS[2] 4
IS[2] IS[4] 4
IS[4] IS[3] 4
IS[1] IS[4] 3
IS[2] IS[3] 3
IS[1] IS[3] 2

SEQUENCE :
------------------
Here we Represent The Maximal Spanning Tree in Form of a List :

VERTEX 1 VERTEX 2 EDGE


IS[1] IS[2] 4
IS[2] IS[4] 4
IS[4] IS[3] 4

SHORTEST COMMON SUPERSTRING :

ABCDEFGH
-----------------
ABCDE
BCDEF
CDEFG
DEFGH

DISCUSSION :

1. The code is implemented considering certain basic assumptions, such as:


i. Each substring entered must be of equal length.
ii. No such substring should be entered that have no common
characters when compared with all other substrings.
iii. No Si is a substring of Sj, where both Si and Sj are substrings of S.

2. Certain boundary conditions have also to be maintained, such as:


i. The substrings entered must be within of 10 characters.
ii. A maximum of 10 substring may be entered.
iii. The output string is 1D array capable of storing a maximum of 30
characters.

3. The output is displayed in a formatted way to that it is easier for the user to
understand the formation of the shortest common superstring.

13
4. The Kruskal’s algorithm is generally used to compute the minimal spanning tree
but here it is used to find the maximal spanning tree. This is possible because
the structure ‘sequence’ used here stores the edges in an non-increasing order.
The Kruskal’s algorithm starts by sorting all edges of a graph. The time
complexity of this sorting operation is O(ElogE) if there is ‘E’ number of edges in
the graph. The ‘for’ loop in the algorithm makes ‘E’ number of iterations in the
worst case. In each iteration, the major task is to find whether the current edge
introduces a cycle. The complexity of detecting a cycle is O(log n) in the worst
case if the graph contains ‘n’ vertices. Thus the overall time complexity of the
algorithm is O(ElogE) + O(Elogn).

5. This program can even be further modified by using suffix trees. It can be done
by building a tree containing all suffixes of all strings of S. String Si overlaps with
Sj iff a suffix of Si matches the prefix of Sj- traversing these vertices in order of
distance from the root defines the approximate merging order.

APPLICATION OF THIS PROBLEM:


The shortest common superstring problem (SCS) has been extensively studied
for its applications in string compression and DNA sequence assembly. Although the
problem is known to be Max-SNP hard, the simple greedy algorithm performs extremely
well in practice. To explain the good performance, previous researchers proved that the
greedy algorithm is asymptotically optimal on random instances. Unfortunately, the
practical instances in DNA sequence assembly are very different from the random
instances. The shortest common superstring problem (SCS) has been extensively
studied for its applications in string compression and DNA sequence assembly. Although
the problem is known to be Max-SNP hard, the simple greedy algorithm performs
extremely well in practice. To explain the good performance, previous researchers
proved that the greedy algorithm is asymptotically optimal on random instances.
Unfortunately, the practical instances in DNA sequence assembly are very different from
the random instances.

BIBLIOGRAPHY :

1. Lecture notes on Shortest Superstring Problem from Massachusetts Institute of


Technology.Seminar in Theoretical Computer Science.
2. Research work from Kenneth S. Alexander 1
Department of Mathematics,
University of Southern California. Los Angeles.
3. From Scholarly Articles available from net.
4. From the book : The Algorithm Design Manual BY Steven S.Skiena
Stony Brook University ,
Dept. of Computer Science.
5. Self experience.

14

You might also like