Shortest Common Superstring1
Shortest Common Superstring1
Shortest Common Superstring1
INPUT OUTPUT
PROBLEM DESCRIPTION :
The shortest superstring problem takes as input, several strings of different
lengths and finds the shortest common string that contains all the input strings as
substrings. This is helpful in the genome project since it will allow researchers to
determine entire coding regions from a collection of fragmented sections. Shortest
common superstring arises in a variety of applications, including sparse matrix
compression. Suppose we have an (n x m) matrix with most of the elements being zero.
We can partition each row into (m / k) runs of k elements each and construct the shortest
common superstring S' of these runs. We now have reduced the problem to storing the
superstring, plus an (n x m / k) array of pointers into the superstring denoting where
each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but
there is a space savings when |S| << mn.
INPUT DESCRIPTION :
Given a set of n strings, S = {S1,...,Sn}, we want to find the shortest string s that contains
Si as a substring.
OUTPUT DESCRIPTION :
The output of this problem is the shortest common superstring from the given set of
substrings and printing the substrings in a shifted fashion whenever a match is
encountered.
ASSUMPTIONS :
We assume that no Si belongs to S is a substring of Sj belongs to S. This problem is
NP-hard. Such a problem scales up exponentially and consequently large instances
cannot be solved in real life time by electronic computers.
1
TECNIQUES THAT CAN BE APPLIED:
ALGORITHM:
Step 2 : Identification of which pair of string have maximum overlap for every pair by
using Brute – Force Algorithm or Knuth Morris Pratt Algorithm.
Step 3 : Replace the pair of strings with maximum overlap by a merge string until only
one string remains.
Step 4 : Output the string with the superstring in one line and approximately shifting the
substring to the right after a mismatch.
To solve the above problem using TSP we have to do the following operations:
ALGORITHM TSP:
Step 1: First, find out all (n -1)! Possible solutions, where n is the number of string inputs.
Step 2 ; Next, determine the minimum cost by finding out the cost of everyone of these
(n -1)! solutions.
2
3. THE SET COVER ALGORITHM APPROACH :
Using the set cover method, we obtain a 2Hn factor approximation algorithm.
Given input, S = {S1,...,Sn}, we construct a string rijk for all possible combinations Si and
Sj belongs to S (where k is the maximum overlap between the two). Now, let’s call the
set of all such r, R. Now let v belong to given set, such that sub(v) = {s belongs to S| s is
a substring of v}. All possible subsets of S are sub(v) for all v belongs to S U R.
Step 1 : Use the greedy set cover algorithm to find a cover for the instance C.
Step 2 : Backwards construct v1, ...vk from the sets selected by the algorithm so that
sub(v1)U...U
sub(vk) is the cover for C.
Step 3 : Uniting the strings v1, ...vk gives the shortest superstring via set cover.
ALGORITHM :
One method for computing the maximum weight spanning tree of a network G –
due to Kruskal can be summarized as follows.
Step 1 : Sort the edges of G into decreasing order by weight. Let T be the set of edges
comprising the maximum weight spanning tree. Set T = NULL.
Step 3 : Add the next edge to T if and only if it does not form a cycle in T. If there are no
remaining edges exit and report G to be disconnected.
Step 4 : If T has n−1 edges (where n is the number of vertices in G) stop and output T.
Otherwise go to step 3.
3
OUR LOGICAL APPROACH :
We begin our approach by taking ‘n’ substrings from the user and storing them in
a 2D array. The user may enter a maximum of 10 substrings, which is the boundary
condition of the program. We have implemented the program, using various structures.
Firstly, after the inputs are encountered we use the structure ‘matrix’ which keeps the
record of common character between each pair of substring. We also use a structure
‘edgelist’ to represent each substring as a vertex and the number of common characters
between pairs of substrings as edges. In this way, the whole structure is represented in
a form of a tree. Later we use Khuskal’s algorithm to form the maximal spanning tree,
with help of the structure ‘sequence’ which stores the edges in a non-increasing order.
Finally, a function is invoked which rearranges the vertices in an efficient way so that the
shortest common superstring can be formed.
4
struct sequence/*structure which holds the maximal spanning tree*/
{
int u,v,weight;
}sequence[10];
struct dummy_sequence
{
int u,v,weight;
}dseq[10];
void main()
{
int i,j;
/*initialization of string_matrix*/
for(i=1;i<=10;i++)
for(j=1;j<=10;j++)
matrix[i].value[j]=0;
display_sub_strings();
create_matrix();
display_matrix();
create_edgelist();
arrange_edgelist();
display_edgelist();
create_sequence();
arrange_sequence();
display_sequence();
5
create_super_string();
display_super_string();
}/*end of main*/
6
void display_matrix(void)
{
int i,j;
printf("MATRIX : ");
printf("--------------");
printf("Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j");
for(i=1;i<=total;i++)
printf("IS[%d]",i);
for(i=1;i<=total;i++)
{
printf("IS[%d]",i);
for(j=1;j<=total;j++)
printf("%d",matrix[i].value[j]);
printf("\n");
}
}/*end of function*/
7
edgelist[j-1].weight=temp;
temp=edgelist[j].u;
edgelist[j].u=edgelist[j-1].u;
edgelist[j-1].u=temp;
temp=edgelist[j].v;
edgelist[j].v=edgelist[j-1].v;
edgelist[j-1].v=temp;
flag=1;
}
j--;
}
}
}/*end of function*/
8
return(1);
return(0);
}/*end of function*/
9
printf("--------");
printf("Here we Represent The Maximal Spanning Tree in Form of a List : ");
printf("VERTEX 1 VERTEX 2 EDGE");
for(i=1;i<=sequence_count;i++)
printf("IS[%d] IS[%d] %d",sequence[i].u,sequence[i].v,sequence[i].
weight);
}/*end of function*/
10
OUTPUT INSTANCE 1 :
ENTERED SUBSTRINGS :
------------------------------------
IS[1] = ABRAC
IS[2] = ACADA
IS[3] = ADABR
IS[4] = DABRA
IS[5] = RACAD
MATRIX :
-------------
Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j
EDGELIST :
-----------------
Here The Non-zero Entries of The above Matrix is Represented in Form of an Edgelist
11
IS[5] IS[4] 1
SEQUENCE :
-------------------
Here we Represent The Maximal Spanning Tree in Form of a List :
ADABRACADA
---------------------
ADABR
DABRA
ABRAC
RACAD
ACADA
OUTPUT INSTANCE 2 :
ENTERED SUBSTRINGS :
------------------------------------
IS[1] = abcde
IS[2] = bcdef
IS[3] = defgh
IS[4] = cdefg
MATRIX :
-------------
Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j
12
EDGELIST :
----------------
Here The Non-zero Entries of The above Matrix is Represented in Form of a
Edgelist
SEQUENCE :
------------------
Here we Represent The Maximal Spanning Tree in Form of a List :
ABCDEFGH
-----------------
ABCDE
BCDEF
CDEFG
DEFGH
DISCUSSION :
3. The output is displayed in a formatted way to that it is easier for the user to
understand the formation of the shortest common superstring.
13
4. The Kruskal’s algorithm is generally used to compute the minimal spanning tree
but here it is used to find the maximal spanning tree. This is possible because
the structure ‘sequence’ used here stores the edges in an non-increasing order.
The Kruskal’s algorithm starts by sorting all edges of a graph. The time
complexity of this sorting operation is O(ElogE) if there is ‘E’ number of edges in
the graph. The ‘for’ loop in the algorithm makes ‘E’ number of iterations in the
worst case. In each iteration, the major task is to find whether the current edge
introduces a cycle. The complexity of detecting a cycle is O(log n) in the worst
case if the graph contains ‘n’ vertices. Thus the overall time complexity of the
algorithm is O(ElogE) + O(Elogn).
5. This program can even be further modified by using suffix trees. It can be done
by building a tree containing all suffixes of all strings of S. String Si overlaps with
Sj iff a suffix of Si matches the prefix of Sj- traversing these vertices in order of
distance from the root defines the approximate merging order.
BIBLIOGRAPHY :
14