8. PMS Algorithms
Based on the fundamental approach of the PMS algorithm, two categories of motif search processes are available: one as profile-based approach and another as pattern-based approach. Profile-based approach determines the initial location of the motif occurrences in the input biological sequence, whereas the pattern-based approach determines the motif sequence itself. Owing to the importance of obtaining the required search pattern, the survey made in this paper is confined to only the pattern-based algorithms. The pattern-based approach initiates the process of local search with certain random candidate sequences ([
14,
15,
16,
17]). A pattern-based or profile-based approach can either be exact or approximate. The approximate approaches are generally heuristic in nature, and the guarantee of getting the optimal solution or correct solution is minimal. The exact algorithms run through a thoroughgoing search to ensure the correct solution. For the most part, PMS approximation algorithms are faster as well as popular compared to the exact algorithms; however, they are restricted to the scenario where approximate motif finding is sufficient. According to different researchers, the classification of planted (
l, d) -motif discovery falls into the following categories: local search-based algorithms, probabilistic algorithms, genetic algorithm of Machine Learning-based approach, and the exact algorithms. Besides these other approaches that are also available in the literature, this paper attempts to analyze and review a few of the frequently found motif search algorithms under these categories.
Figure 1 depicts the architectural view of the complete classification of the search algorithms according to the type of motif search as well as the search mechanism incorporated.
- A.
Probabilistic Algorithms
Probabilistic algorithms for motif search mostly apply the statistical techniques of EM (Expectation Maximization) or Gibbs sampling or its extensions ([
18,
19,
20]).
Expectation Maximization (EM) [
20] carried out by E-step succeeded by the M-step. Initially, the E-step evaluates the expected likelihood of the perceived sequence data based on current settings of the argument, and then, the M-step updates these to optimize the expected likelihood function. The local optimization approach EM monotonously upgrades the expected likelihood; however, its sensitiveness to the initial position does not give any guarantee of convergence into a global optimum. Due to this drawback, the motif search algorithms which are based on EM proceed mostly from multiple initialization points to upgrade the likelihood of entering the global optimum. These heterogeneous initializations also enhance the chances of getting biologically suitable motifs that might not resemble the global optimum. Initially, the EM method was proposed for protein sequences, but later it became also applicable to DNA sequences. Assuming that every sequence consists of at least one common site, this model does not use any alignment of the sites. The EM algorithm handles the ambiguity of site position by applying the principle of missing information which allows recognition of the sites as well as the depiction of the binding motifs.
Gibbs Sampling [
18] is an MCMC approach based on EM, and hence, it is applicable to the motif search problem with insufficient information. The uniqueness of the Gibbs sampling search procedure is undirected and global over a parameterized distribution from which it derives the random samples of hidden variants. The procedure is iterated by re-estimating the parameters based on the arbitrarily generated samples. But this global search of Gibbs sampling attracts a notable computational cost as it has to go through several iterations to converge to a computational likelihood surface. The assumption of a single occurrence of the motif for every sequence gives the method its name of “site sampler”. Like EM, in the Gibbs sampler, every step’s result depends on the previous step defining the “Markov Chain”, and the process of selecting the next step uses random sampling instead of being deterministic, defining “Monte Carlo” ([
21,
22,
23,
24]). The procedure extracts the length
l substring from each of the t input sequences that maximize their “similarity”.
MEME [
25] algorithm extends the EM algorithm to search the motif in a pool of unaligned biological sequences where insufficient prior knowledge about the motif is available. MEME initially gives input to the EM algorithm as the subsequence of the biopolymer sequence to accelerate the possibility of getting motifs that are globally optimum. It withdraws, the assumption of the presence of a single mutated motif per sequence. In order to discover the various distinct motifs in the same set of sequences, it incorporates a process of probabilistically eliminating the mutated motifs [
19]. The algorithm receives a group of unaligned biological sequences and the motif length, and for each motif, it produces the motif model with its associated threshold value. This model is then treated as the optimal classifier for finding the occurrences of each motif in other datasets and returns an alignment of motif occurrences. In a single database, this process can be able to discover the distinctive motifs with their number of occurrences.
CONSENSUS [
26] finds the functional relationship through the alignment of a group of related binomial sequences. For dissimilar types of aligned sequences, the score of relatedness is measured through a specific measurement strategy. Again, in the case of unknown alignment, it finds the alignment which optimizes the scoring scheme. The alignment is achieved by considering four components. First, the algorithm reviews the information content which is a log-likelihood scoring scheme. Second, it describes two methods for approximating the respective score of information content by a procedure that merges the technique of numerical calculations with the process of large-deviation statistics. Third, it describes the process of determining the number of probable alignments from the overall amount of sequence data. Fourth, it describes a greedy algorithm that identifies the alignment of functionally related sequences. Finally, it verifies the correctness of the calculation of the algorithm with an example.
Motif Sampler or Gibbs Sampler [
27] enhances the robustness and performance of Gibbs sampling to noisy datasets. Noise in the data set is due to the presence of sequences not containing the motif or due to the large size of the input sequence as compared to the small size of the motif. The behavior of the algorithm in the presence of an increasing amount of noisy data has extensively been verified. The algorithm initially estimates the number of instances of the motif with the help of the probabilistic framework. Then, the author modifies the Gibbs sampling algorithm by introducing the higher-order background model (a transition matrix) based on an order m Markov process. All these modifications have been achieved in the iterative procedure of the Gibbs sampling.
BioProspector [
28] builds upon the Expectation-Maximization (EM) method, addressing the problem of motif discovery for heterogeneous data. It combines two major characteristics of a motif’s consequences into one probabilistic score. The first one is the over representation which builds upon the number of occurrences of the motif in each input, and the second one is the cross-species conservation of individual motif occurrences over the input sequences. This approach enhances the algorithm of Expectation-Maximization to discover the motif for the given regulatory regions of co-regulated genes. It relates the species to any user-specified phylogenetic tree and evaluates the motif by relating its orthologous occurrences with a probabilistic model of evolution. The algorithm is also capable of managing the cases of incomplete heterogeneous data and thus makes it suitable for applications that handle incomplete orthology information.
MCEMDA [
29] (Monte Carlo EM Motif Discovery Algorithm) is a Monte Carlo category of the Expectation-Maximization de novo motif search process that uses the technique of position weight matrix (PWM). It overcomes the drawback of traditional EM’s of being confined to a local optimum. In local alignment space, it carries out the stochastic sampling, which makes the popular Expectation-Maximization (EM) algorithm efficient. MCEMDA begins from one elementary model and successively applies the Monte Carlo simulation upgrading the parameters till the convergence. To manage the phase shifts and numerous modal issues, it introduces a log-likelihood profiling method. A novel GMA algorithm (Grouping Motif Alignment) forms the cluster or group of the population by locally aligning the candidates to select the strong motifs successfully. MCEMDA shows excellent capacity when compared to multiple sequence alignment methods.
Table 1 summarizes the probabilistic algorithms.
- B.
Local Search Based Algorithms
The algorithms that incorporate the local search approach might not be able to achieve the required planted motif always.
TEIRESIAS [
13] is a combinatorial, local search algorithm, which is competent to output each pattern present in the minimum (user-defined) number of sequences. In the process of searching, it avoids the consideration of the entire search space successfully and hence is proved to be the efficient one. The algorithm uses regular grammar to represent motifs and report maximal patterns. It builds upon the approach that if an (
l, w) pattern P is occurring in a minimum of k sequences, then its (
l, w) sub-patterns also appear in at most k sequences, resulting from the largest patterns out of smaller sub-patterns. Through two phases, TEIRESIAS algorithm works scanning phase and convolution phase. The scanning phase applies the pruned exhaustive search to obtain all possible (
l, w) patterns available in a minimum of k sequences having specifically
l non-wildcards. In the convolution phase, these fundamental patterns are extended by gluing together. The quasi-linear running time of the algorithm with respect to the volume of generated output makes the algorithm output-sensitive in nature.
WINNOWER [
12] is a graph-based algorithm, which finds the motif by searching a clique of the graph. It initially builds a set C of all feasible
l-mers from each input sequence. Then, it constructs a graph where the node corresponds to the
l-mer of set C, and edge corresponds to the connector of two
l-mers from different strings if they are within 2d Hamming distance. In other words, for every motif, a corresponding clique of size t is available in the graph. This narrows the motif search problem into a clique finding problem of size t. This algorithm implicitly builds one t partite graph where each partition contains
l-mers or nodes generated from t different sequences. Winnower treats all edges of the graph uniformly without making any distinction between the edges of high and low similarity, which requires considerable computational resources and is thus relatively slow.
SP-STAR [
12], proposed by Pevzner and Sze, is a memory efficient, faster algorithm than WINNOWER. SP-STAR uses a sum-of-pairs scoring method D(W,
S) =
to effectively make the distinction of signal patterns and random patterns and thus is separating signals from noise in a better way. This is a heuristic-based approach to find a pattern W that corresponds to min
W∈S D(W,
S), using the same parameter as WINNOWER uses in its graph construction. SP-STAR score the
l-mers along with the edges of G, accurately to eliminate more edges than Winnower does in its iteration.
cWINNOWER [
30], proposed by Liang
S, is the enhancement over WINNOWER [
12]. Liang
S improves performance by adding some consensus constraint on graph designing with an expectation of deleting the spurious edges in the initial stages. The consensus constraint contains stringent constraints to claim whether the edge under analysis should be a part of the t-clique or not. This algorithm identifies the fuzzy motifs of DNA sequences that are generous in protein-binding signals. Here, signal refers to a short (length
l) nucleotide pattern with maximum d mutations from the motif. The algorithm searches those signals or motifs whose numerous mutated copies are found in the DNA sequences with sufficient abundance. The use of consensus constraint enables the detection of many weak signals.
In the Random Projection [
14] algorithm proposed by Buhler and Tompa, the global search procedure is used to reach a better seed of the input which WINNOWER [
17] failed to solve. It considers motif M as the
l-mer and a set C as the group of
l-mers from all given sequences. For some suitable value of k (where k <
l − d and also k is not too small to operate with), it projects the
l-mers of set C across the k randomly chosen positions, resulting in mapping from
l-dimensional space into the k-dimensional subspaces. Then, it treats every k-mer as one integer and hashes them into the corresponding hashed group out of a total 4
k hashed group. If k <
l − d, but it is not too small, then one hashed group can contain several
l-mers. The highly enriched hashed group enables the recovering of motifs. It set the hashed group threshold to s for the elimination of the hashed group having
l-mers less than s. By this process, the k-mers of the motif M and the hashed group become the same. For t sequences of length n each, the total expected
l-mers are t(n −
l + 1), and the possible number of k-mers is 4
k. Thus, the expected
l-mers per hashed group is t(n −
l + 1)/4
k, and the threshold value s is double its magnitude. Iteratively, the random hashing function h() is applied r times (for some appropriate r value), to ensure the hashed groups of size more than s are obtained at least once. Then, it collects all those
l-mers for further processing to conclude with the final answer as motif M.
MULTIPROFILER [
26] is a concept used by authors Uri Keich and Pavel A. PevZner, for searching the exact motifs in the DNA sequences. Multiprofiler generalizes the profile-based approach and enables to the identification of the exact consensus sequences that missed in the detection process of the standard profiles. This algorithm exceeds other major algorithms of motif search in many synthetic models.
Algorithm Pattern Branching [
17] is built upon the local search techniques. For every l-mer of t input sequences, this procedure searches the neighbor
l-mers and scores them properly. There are overall t(n −
l + 1)
l-mers available in t sequences, and against each
l-mer, the total possible neighbors are
, resulting in a sum of t(n −
l + 1)
neighbors which is very large. As a solution, for a selected
l-mers, the algorithm computes the best neighbor in an iterative fashion and outputs them as the motif.
Profile Branching [
17] is a profile-based version of the pattern branching algorithm. It searches the motif in the motif profile space, instead of searching in the motif pattern space. It extends the pattern branching in several ways. It initially converts every sample string
Si to a profile X(
Si), and to score those profiles, it generalizes the scoring process. It also revises the branching method used in pattern branching such that it can be relevant to profiles and apply it iteratively k times (for suitable k value), on each
l-mer of input sequences. Then, it executes the Expectation-Maximization algorithm to converge at a top-scoring profile. However, the pattern-based approach outperforms the profile-based approach by being five times faster on some challenging instances but fails to find the motif with numerous degenerate positions.
MotifCut [
31] uses the graph-theoretic approach to address the problem of motif search. It initially constructs a graph considering the nodes as the substring edges as the connector of similar substrings. Then, it finds the maximum density sub-graph (MDS) in polynomial time. MotifCut exhibits two steps. First, it converts the given sequences to a group of K-mers, where the overlap and duplicate k-mers are considered as a distinct one and build a set of vertices. Second, it connects every pair of vertices through the edge, assigning one weight value. The weight value signifies the number of mismatches of the vertices. Then, it finds a higher degree of pairwise similarity between the vertices to construct the sub-graph representing the set of motifs. An important characteristic of MotifCut is that the approach is decidedly different from the frequently used PWM models, giving rise to significantly different motif generation.
Table 2 summarizes motif discovery algorithms based on local search approach.
- C.
Machine Learning Based Algorithms
In the literature, a collection of machine learning-based motif search approaches is available. Among these, the approach of genetic algorithm has a distinctive strength on the grounds of its frequent and popular usage [
2]. The local search technique of standard machine learning does not reach the globally optimal solution. In spite of belonging to the family of the machine learning approach, the genetic algorithm flourishes due to its advantage in carrying out the global search to reach an acceptable global solution in genuinely less time.
FMGA [
32] approach takes the benefit of the genetic algorithm strategy to output the better exact motif in less computation time compared to Gibbs Sampler [
27] and MEME [
25]. Initially, it computes and allocates a fitness score using a defined match function for each length
l substring. Then, for every sequence, it sums up the individual fitness score to compute the value of Total Fitness Score (TFS). This step continues iteratively until the solution converges to an optimal state. In the process of convergence to the optimal solution, the next-generation candidate motifs are collected in two ways: one from the set of candidates with high TFS values and another through the selection process of the weighted wheel method. On the selected candidate motifs, it applies the process of the mutation using the position weight matrix to produce the parental patterns. Subsequently, to extract the new generation’s optimal child, it applies the crossover process on the parental patterns with uncertain code penalties. The essence of randomly generating patterns of FMGA justifies the solution for not reaching to the local minimum.
GAMOT [
33] is one of the competent genetic algorithms to generate the optimal solution in minimum time for short as well as long motifs in a massive motif project. For the purpose of achieving the linear time and space complexity, GAMOT employs one additional motif finding step prior to the initiation of the evolutionary process of creating an extremely fit population. This contemporary motif finding step is the pivotal attribute of GAMOT that employs the total distance of the respective consensus as the scoring function. Taking the advantage of the scoring function, it deduces the highly occurring nucleotide from respective motif locations from the 4
l feasible patterns. However, the similarity between the encountered consensus and the actual motif is very high, and owing to this reality, GAMOT minimizes the search space from 4
l to (n −
l)t consensus. As the initial step, GAMOT extracts the strings with good fitness and considers those to be the initial population. Then it applies iteratively the following steps to converge to the best and satisfactory individuals. It selects two distinct consensuses from the successive population through the process of linear ranking approach and creates next candidate consensus strings by employing a two-point crossover. In each iteration, it replaces the weak individuals with newly created individuals.
GAME [
34] software uses the feature of the genetic algorithm framework for the extraction of the optimal motif. Unlike FMGA, GAME software takes advantage of the standard operators of the genetic algorithm used for motif search problem in addition to two extra operators SHIFT and ADJUST. The process of generation of the initial population happens through the random selection of the candidates which in turn minimizes the time and space of initial population generation. The simulation made by the software works with greater flexibility to focus only on the domain of the solution space among the entire population. The GAME software proves its dominance over BioProspector and BioProjector [
28] and MEME [
25] on various real data operations. The enhanced form of GAME performs better on manifold motif types.
GEMFA [
35] is an EM-based genetic algorithm that integrates the heuristic search strategy of the genetic algorithm with the EM motif discovery method. This hybrid algorithm intelligently conducts the EM algorithm to easily escape from the locally minimal solution. One section of the GEMFA algorithm operates as a function optimizer by carrying out the multiple local arrangements as per the individual’s maximum likelihood and reducing the MDL (minimum distance length) value in every iteration of the genetic approach. The search process begins by taking a population into account through the multiple local alignments of encrypted chromosomes. Then, it moderately evokes the new population through the conventional process of reproduction by applying the steps of crossover, mutation, and selection until the optimal or best population is reached.
MOGAMOD [
36] is one multi-objective genetic algorithm designed to locate the best motif existing in the given biological sequence. The backbone of MOGAMOD is the multi-objective algorithm NSGA-II [
37] which is popular for its maximal performance. This multi-objective approach comprises a class of non-dominant initial population or solution. From these sets of the initial population, the algorithm extracts the best motif by revolutionizing the search process as a composition of three diverging optimization problems as maximization of similarity, enhancing the length of the motif and encouraging it to become the candidate motif. The performance of this algorithm is eventually improved by giving flexibility to the selection of a similarity measure in discovering the motif. The essence of the flexibility that exists in the selection process of the similarity index improves the performance of the MOGAMOD motif discovery algorithm.
GARPS [
38] motif search process amalgamates the stochastic projection policy of random projection with the global search ability of genetic algorithms. The GARPS initially exerts the random projection policy on the input sequences by assimilating the position weight function and one original hash function h(s) formulated from the position weight function. This process generates the initial population of the worthy candidate with a dense signal for the iteration of the genetic algorithm. The random projection configures the k-mers from the respective
l-mers by picking the consensus from randomly chosen k positions and designing a hash function h(s) from this. Based on the outcomes of the hash function h(s), the
l-mers of individual sequences are hashed into the respective buckets. Then, from the enriched buckets of candidate motifs, the initial population is obtained. From the computed initial population, the refined candidate motif is obtained through the process of genetic algorithm iteration. The experimental result and the global search strategy of GARPS proves its efficiency and robustness as compared to the projection algorithm. Basically, GARPS is popular to handle the challenging instances and the weak motif in the group of the genetic algorithm approaches.
The algorithm AMDILM [
39] proposes one revolutionary technique of motif discovering which exceeds the GARPS on the simulated sequences as well as on the real biological sequence. The algorithm begins with 64 distinct initial candidates of length three each. Then, iteratively, it imposes three operators of the genetic algorithm, mutation, addition, and deletion, in a sequence based on TFS values (Total Fitness Score) to increase the length of the motif to the desired one. For every candidate, the TFS value can be computed by finding the minimum distance of 64 candidates out of their corresponding maximum distances from the initial biological sequences. In every epoch of the iteration, the following three phases of operation take place employing the genetic algorithm operators until the length of the candidate motif reaches the length
l. In the first phase, the mutation operator chooses one location from individual candidates and randomly replaces the location with any of the remaining nucleotides. In the second phase, the addition operator randomly inserts any nucleotide at the beginning and end of the candidate separately to have the two new candidates of length L+1 and preserves the candidate with more TFS value. Finally, in the third phase, the deletion operator removes the added nucleotide to carry forward the next iteration. To escape from falling into the local optimum, the AMDILM process starts the iteration from a distinct variety of initial populations.
GENMOTIF [
40] (genetic algorithm to discover flexible Motif) is one genetic framework-based algorithm that uses the time series motif discovery method. It can be adjusted to any situation comfortably like probing in a span of segment length, associating with several dimensions, applying consistent scaling, and using multiple compatible grouping standards. It has two instinctive parameters, which once fixed within the limit never affect the performance of the algorithm. Due to this, GENMOTIF is named the parameter-friendly algorithm.
MHABBO [
41] is one improved multi-objective biogeography-based optimization (BBO) algorithm that uses a differential evolution (DE) advance to discover a motif from DNA sequences and has gained excellent results. Its fitness function builds upon the information distribution within the habitat individuals. In each generation, the algorithm iteratively changes the migration probability and mutation probability based on the relationship among the average function cost and the fitness function cost. This algorithm redefines the fitness function based upon the information distribution and the Pareto dominance relation. The differential evolution (DE) concept combines the algorithm with the modified mutation procedure. It improves the migration operators that build upon the number of iterations to satisfy the requirement of motif discovery. Furthermore, immigration rates and emigration rates based on a cosine curve are modified to generate promising candidate solutions. The MHABBO algorithm performs better in terms of the quality of the final solutions.
KEGRU [
42] model predicts the motif binding sites through Recurrent Neural Network (RNN) based on a convolutional neural network. It integrates a Bidirectional Gated Recurrent Unit network with K-mer embedding activity. In three phases, the model incorporates the process. First, it divides the DNA sequence into some specific strides and lengths and then names those as K-mers. Secondly, it uses the word representation algorithm “word2vec” by pre-training every K-mer to a corresponding word. Finally, it constructs a deep-learning-based GRU model for feature classification and feature learning. With experimental evidence, it validates the demand for being an inexpensive and timely motif extraction model.
DESSO [
43] is a DL-based motif finding framework that performs in a better way than the existing tools beyond the state-of-the-art. It predicts cis-regulatory motif and TFBSs identification through binomial distribution and deep neural network. The performance of DESSO expands due to the integration of DNA shape feature detection. In addition to the prediction, DESSO can identify and analyze the structural binding sites through the integration of a deep-learning framework with DNA binding complexity. The experimental results demonstrate the insistence of identifying the mysterious or unknown motifs and their shape factors which were unidentified earlier.
The Hierarchical LSTM and attention network [
44] method extracts the interdependency between various DNA, RNA, and protein sites. Instead of emphasizing motif discovery/prediction, it focuses on the region of the DNA or RNA binding sites through the application of the attention mechanism. On the ground of this, it is able to achieve the optimal combination of K-mer stride window, K-mer length, and its sentence length, and additionally, through hyper-parm experiment, it also reaches the optimization function.
Table 3 summarizes motif discovery algorithms based on the machine learning approach.
- D.
Exact Algorithms
The beauty of the PMS exact algorithm lies in the guarantee of furnishing all the real (
l, d) motifs present in the given biological data. However, its NP-complete nature makes its worst-case execution time to be exponential in some parameters. For some specific type of problem, a scheme called Polynomial Time Approximation Scheme (PTAS) [
45] is present, which can give polynomial-time execution in its worst case. Most of the known exact algorithm applies the process of search on a certain standard random sequence of data in the following fashion: Randomly, it creates twenty input sequences of size 600 each from the set of alphabets ∑ = {A, T, G, C}. Then, it creates a motif M of size
l and inserts all given sequences with maximum d mutation to assure their existence in all the input sequences. Based on the values of
l and d, explicitly, some occurrences of PMS have been recognized to be challenging. Any planted (
l, d) occurrence is termed as the challenging instance if the number of predicted motifs occurring by random chance is one or more. For example, the challenging instances of motifs are (9, 2), (11, 3), (13, 4), (15,5), (17, 6), (19, 7), (21, 8), (23, 9), (25, 10), and (26, 11). In the literature, it is customary to compare the exact PMS algorithms performance specifically on these challenging instances.
In the way of obtaining a motif, the exact algorithms are categorized into two approaches: sample-driven and pattern-driven. The exact algorithm can use the concept of either pattern-driven or sample-driven or in some cases also a combination of both. From the given t sequences, the sample-driven approach enumerates all achievable (n − l + 1)t l-mers to compute the common neighborhood and outputs them as the motif. On the other hand, the pattern-driven approach enumerates all achievable l-mers and outputs the l-mers that occur with a maximum frequency in all input sequences as the motif. The space complexity of the sample-driven approach is higher compared to the pattern-driven approach. The algorithm that uses the combination of approaches initially extracts the l-mers from multiple input sequences using the sample-driven approach and then generates the common d-neighborhood using the pattern-driven approach. Nonetheless, most of the biologists prefer the pattern-driven algorithm due to its lower space complexity.
The data structure or common method adopted by the exact algorithms are enumeration of patterns, search tree, suffix tree, tries, mismatch tree, linked list, graph, hash function, and so many. In the approach of pattern enumeration, from the input sequence, the feasible candidate motifs are enumerated and based on the expediency of being a motif, the search process is employed. The variation in this group of algorithms is based on the way of candidate motif enumeration. The group of algorithms in this category is PMS0 to PMS3 [
46], PMSi and PMSP [
47], Improved Pattern-driven [
48], Stemming [
49], PMS4 to PMS6 [
50,
51,
52], PairMotif [
53], etc. The approach based on the search tree constructs a depth d search tree of the candidate l-mer by position-wise changing its character in each iteration. Then, with the hope of getting the motif, it traverses in a depth-first order of the search tree by processing each node encountered in the traversal path. The group of algorithms falling under this approach are PMSprune [
54], qPMSprune [
55], Pampa [
56], PMS3p [
57], provable [
58], and qPMS7 [
55]. From the special tree-based approach (suffix tree, tries, and mismatch tree), the suffix tree is popularly used. The suffix tree approach applies some pre-processing on the input data to organize the data in a better way and to accelerate the process of motif search. It builds a single suffix tree from the input sequences by extending the motif length from zero to
l. For every individual
l-mer, a corresponding unique suffix tree can be built which produces a distinct depth order tree traversal sequence. From these sets of unique traversal sequences, the search process extracts the commonly found sequences and produces those as the motif. The group of algorithms from this category are SPELLER [
59], SMILE [
60], MITRA [
61], CENSUS [
62], PSMILE [
63], RISO [
64], RISOTTO [
65], EXMOTIF [
66], SLI-REST [
67], etc. In this group, other than the suffix tree, MITRA [
61], and CENSUS [
62] uses the data structure mismatch tree and tries, respectively. The voting [
68] algorithm uses the idea of the hashing search process. Some of the recent algorithms incorporate a combination of pattern-driven and sample-driven approaches to minimize the computational time. PMS8 [
69] and qPMS9 [
70] are the algorithms in this category.
PMS0 [
46] algorithm receives t input sequences of n size each and constructs three groups of
l-mers, namely, C, C′, C″ as follows. It teams up all possible t(n −
l + 1)
l-mers to group C, all (n −
l + 1)
l-mers of first sequence to group C′ and all
neighbors of each l-mer of set C′ to C″. Then, it evaluates the Hamming distance of
l-mer u and
l-mer v for u ∈ C and v
C″ and produces the
l-mer v as the motif those are available with maximum of d Hamming distance from every input sequence. For
= 4 the time complexity of PMS0 is O(n
2tl
3d).
PMS1 [
46] algorithm takes the advantage of radix sort to do the sorting of a group of
l-mers. It constructs the set Ci by bringing together all
l-mers u of input sequence
Si for 1≤
i ≤ t and the set Li by bringing together all neighbors v of
l-mer u that are available maximum by d Hamming distances. The number of
l-mers present in set Ci and
Li is t(n −
l + 1) and
, respectively. Then, using the process of radix sort, it sorts the
l-mers of
Li and hence removes the redundant neighbors. Then, it successively merges two consecutive sets,
Li + 1 =
Li ∪
Li + 1 for 1 ≤
i ≤ t, to figure out the common neighbors.
The improved pattern-driven [
48] enhances the basic pattern-driven approach by giving the assurance of returning the optimal motif. Using the fundamental pattern-driven approach, it enumerates the possible 4
l competent motifs with a computation time irrespective of the size of the given sequence, and hence, the complexity of the search is reduced by a factor of n. The search time of this algorithm is O(4
llt) for t input sequences. The process of search is accomplished in two steps. In the first step, on 4
l competent motifs, it applies exhaustive search which simultaneously allows the invariant position and disallows the mismatches among the residues. In the second step, the attainable motifs go through the refinement process with a defined flexibility value to obtain the required motif. This algorithm can handle the motif with higher
l and d values, compared to the basic one.
PMSi [
47] improves the process of common neighborhood generation of PMS1 [
41] by computing the pairwise relationship among the neighbors instead of handling at a time the whole mass of competent neighbors. For every single ith sequence (for 1 ≤
i ≤ t), it figures out the common neighbors of all the
l-mers to a distinct group, namely, Li. Then, for every pair of sequences S
2i−1 and S
2i, it intersects its corresponding Li’s in a pairwise manner as
for
i = 1 to t/2. At the end of the process, it intersects the entire lists of
and generates the final motif set M. The pairwise approach of the common neighbor finding of PMSi substantially reduces the steps (computation time) and mass of the candidate motif (space requirement) of the original algorithm PMS1 [
46].
PMSP [
47] deploys the idea of the building neighborhood of every
l-mers of the sequence S1, verifying the feasibility of being the (
l, d) motif for each of these neighbors. To achieve this, it iteratively finds the set of neighborhoods let Z for every
l-mer
x ∈ S
1, and for each
l-mer
y ∈ Z, it checks whether there exist
l-mers in all the t sequences that are within Hamming distance d from it. Again in [
54], the search process is reduced by the key observation that instead of checking the distance of
y from every
l-mers of all the sequences, it will be more appropriate to check only those
l-mers that are at maximum 2d distance from
x. PMSP uses O(tn
2) space and takes O(tn
2N(
l,d)) time, for a number of neighbors of any
l-mer
x as N(
l, d). Although its worst-case time complexity is higher than the previous algorithms, it can solve challenging instances (15, 5) and (17, 6) in a reasonable time.
PMSprune [
54] goes one step ahead of PMSP by including some original strategy. Analogous to the PMSP approach, PMSprune generates the neighbors for each l-mer
x of the first sequence and then checks their feasibility of being a valid motif. However, the neighbor generation approach of PMSprune incorporates the efficient pruning technique due to which remarkable reduction in the search space happens. Initially, in a branch and bound method, it brings out the neighbors of l-mer
x by constructing a tree T(
x) of d height. Then, through the depth-first order traversal, it traverses the tree T(
x), and for every
l-mer
y∈T(
x), it computes the maximum of minimum distances
by which y exists in each of the given sequences ‘S’. It outputs the set of y whose
values are maximum d and prunes the nodes of the tree whose own
value as well as the descendants
values are higher than d. This state of the art of the pruning process makes this algorithm distinct from others and also exceptionally reduces space complexity.
Stemming [
49] uses a novel process of neighborhood generation which reduces the computational search space by a factor of the size of the alphabet
. The search process initiates with the neighborhood generation of the candidates followed by the intersection of the neighborhoods to form a set C with maximum m mismatches among themselves. This set C at the outset is considered to be the superset carrying a mix of motifs and some non-motifs. The set C can be characterized by the wildcards or stems where these wildcards or stems are expressed through motif length (
l), Hamming distance (d) and maximum value of mismatch (m) only. As the stems are independent of the alphabet size, the computational search complexity is cut down to only the size of the candidate set C.
PMS4 [
50] is a speedup method which can accelerate any PMS search process that is based on the following two-step practices: step one does the extraction of a group of candidate motifs from all given sequences, and then, step two verifies the candidate motifs against the characteristics of motif to obtain the true motif. The PMS4 speeds up any algorithm when it is fused with the algorithm. First, it does the extraction using the method of corresponding algorithm in k preferred sequences and labels it as the set C (this set C undoubtedly possesses all the actual (
l, d) motifs with the non-motif candidates). Then, it verifies the validity of individual candidates of the set C for being the authentic (
l, d) motif in time O(tn
l). The selection of the value of k is the key feature causing the speedup, and its value differs from one fusion to another.
PMS5 [
51] is the fusion of the PMS1 [
46] and PMSprune [
54] algorithms. It expands the idea of PMS1 by introducing the state of the art of neighbor generation of the PMSprune [
54] algorithm. It receives a set of t input sequences and iteratively searches for (
l, d) motifs from a triplet of sequence
S1,
S2i,
S2i+1 for 1 ≤
i ≤
. From the triplet sequence, it forms a group of 3
l-mers as (
x,
y,
z) where
,
,
, and using the concept of PMSprune, it works out the subroutine Bd(
x,
y,
z) to explore the common neighborhoods of this
l-mer group. The beauty of subroutine Bd(
x,
y,
z) lies in the blending of the process of tree construction of PMSprune with the pre-processing step of Integer Linear Programming (ILP). The pre-processing step of ILP makes the process of extraction of common neighborhood faster compared to its competitor algorithms. However, the use of ILP intensifies the space requirement due to the use of a lookup table.
PMS6 [
52] is the refinement of the algorithm PMS5. It skillfully reduces the searching time and space requirement by introducing improved pre-processing steps and one hashing technique for the lookup table, respectively. The distinctiveness of PMS6 compared to PMS5 lies in the way of thinking about the
l-mer
x of sequence
S1 in the process of motif extraction. Here, the searching process is accomplished in two steps. In step one, it forms a triplet of
l-mers (
x,
y,
z) as of PMS5 and five equivalence classes C(n1, …, n5) according to the type of alliance of the nucleotides corresponding to the
l-mer locations. Then, it places the triplet instance into its corresponding class according to the calculated value of n1 to n5. In step two, the set of motifs among
l-mer
, l-mer
, and l-mer
for 1 ≤
i ≤
is determined by the principles of equivalence classes.
PairMotif [
53] algorithm efficiently decreases the search space of the motif by introducing the state of the art of pairing concept. Initially, it pairs the candidate
l-mers to different input sequences that differ with higher relative distances and then performs the extraction of motifs by iteratively performing the following three phases. In the first phase, it carefully selects the pair of
l-mers which are at 2d Hamming distance by selecting one l-mer from sequence
S1 and another from sequence
Sr for 2 ≤ r ≤ t. The value of r is enumerated in a restrictive way. In the second phase, two filtering techniques are applied to the selected l-mer pairs to cut down the space of
l-mers to be processed in the subsequent stages. This stage directs the candidate
l-mers to the next phase; those are either the motif or tend to be the motif. Finally, the third phase computes the common neighbors among the filtrate candidates by applying the verification of being motif. The empirical outcome proofs its stableness and efficiency to handle a variety of length sequences compared to the existing algorithm of that period.
The algorithms which use a different data structure, such as the suffix tree, or try to or mismatch the tree are described as follows.
Speller [
59] is the first algorithm to introduce the data structure suffix tree to speed up the motif search process. From the given input sequence, it builds a suffix tree in a lexicographic manner that yields unique patterns when is traversed from the root node to any leaf node. On the suffix tree, it applies some pre-processing to handle the gap among the candidate motifs. Then the selection of motif from the suffix tree is done in two steps: step one pulls out the duplicate motif, and step two draws out the common motif. Step one determines the availability of common patterns with some allowable distances in only q (quorum constraint) sequences from the group of t sequences. Step two determines the motif by verifying the feasibility of the generic motifs to become the motifs. The space complexity of SPELLER is O(nt
2/w) with time requirement O(nt
2N(
l, d)) where w is the length of a word, and N(
l, d) stands for the number of neighbors that any
l-mer can have. The introduction of the suffix tree makes easy to organize and pre-process the input data.
The SMILE [
60] algorithm represents all suffixes of the given sequences through the generalized suffix tree in contrast to the classical suffix tree. A distinctive feature of a generalized suffix tree is to assign the unique termination symbol for an individual input sequence and use a bit vector per node to specify the sequence name described by the path from root to leaf. Utilizing the DFS recursive traversal process, the algorithm traverses the suffix tree T by considering it to be a classical lexicographic suffix tree, intending to depict every feasible length l motif. Similar to the SPELLER, it applies the initial search phase on quorum q of input sequences. The algorithm performs the bitwise OR operation on the bit vectors of the nodes that are encountered in the path of motif search, i.e., the set of nodes encountered from the root to that node, and depending on the presence of ones in the bit vectors, it returns the motif. In every step of depth-first traversal, it prunes the set of nodes that lack the required features or lie below the threshold value and backtracks to survey the new valid path.
Algorithm MITRA [
61] exploits the pairwise resemblance of
l-mers by a combinational approach of WINNOWER with the theory of the suffix tree-based approach of SMILE. It constructs a depth
l mismatch tree where the degree of each internal node is
, described by the unique value from ∑. The search space of the motif is split into subspaces at each level, starting from root to that node corresponding to its unique prefix code. These subspaces are described by the path label and keep track of the
l-mers that is found to be present by maximum d mismatches. In the process of the traversal, the algorithm spells out the availability of the patterns P with d Hamming distances in at least the quorum input sequences or the corresponding input subspaces. In the search process, if any subspaces are predicted to be weak, then those subspaces starting from the root are pruned and backtracked to the next path, and if successful, the subspaces are further explored to be processed down. As the pruning technique MITRA only allows the valid subspaces concerning to the motif, the depth l mismatch tree at the end reports the valid motif only.
CENSUS [
62] parallel algorithm gives a solution to the space component of the generalized suffix tree by eliminating the node wise bit vector storage concept. It constructs a t number of tries or a lexicographic tree for t input sequences by briefly encoding the corresponding
l-mers in time O(lnt). Then, in accordance with the reality that multiple motifs can share a common prefix, this algorithm explores iteratively the potential motif for only the possible prefixes of the motif to avoid the unnecessary processing. The encoding process of the tries looks at the count of the availability of motif instead of the size of the motif. Owing to the one-to-one correspondence between the tries and the input sequence, CENSUS uses a distributed memory parallelization technique to eliminate any sharing of information.
PSMILE [
63], the parallel SMILE algorithm, efficiently extracts the structured motif through the use of the fundamental suffix tree. PSMILE parallelizes the technique of SMILE by the introduction of the P number of same-sized buckets with some allowable errors and interval of distances in the consecutive buckets. As at the initial stage of the algorithm, the motif or the content of bucket is untold, the process represents the bucket space through tries or lexicographic tree. However, for some exceptional instances, the algorithm uses the suffix tree of given sequences. The optimality of PSMILE is achieved through an additional balancing technique, i.e., the process of balancing the motif extraction over the existing processor. In this balance partitioning approach, the search space is equivalently split and distributed among the individual loosely coupled processing units, and thus, the speed up is achieved proportional to the number of processing units.
RISO [
64] algorithm extends the SMILE algorithm with a twofold process. In the first step, a suffix tree called the factor tree is built up to a certain level
l instead of considering the complete suffix tree of the entire input sequences. This step specifically reduces the storage requirement of the SMILE. Then, in the second step, it proposes a data structure named a bucket link, which caches the situations required to pass from one bucket to another through the link. The process winds up by thoroughly extracting the subsequent motifs by partially or temporarily modifying the factor tree. Contrary to SMILE, it saves time and reduces search space by not using the pruning technique in the second step.
RISOTTO [
65] based on the idea of PMS0 upgrades the proficiency of RISO by tying up the box-link data structure with the suffix tree data structure. For the given sequence S
1, it builds suffix trees describing the d-neighborhood for every
l-mer and explores it in a depth-first way. In the search process, it steps aside the exploration of the new node for which quorum is satisfied or ultimate length is reached. RISOTTO minimizes the computation time of RISO by avoiding the conflicting type of candidates through the mechanism of caching the maximum extensibility factor information. However, the supplementary space is required for the extensible information storage.
EXMOTIF [
66] is the well-known structured motif algorithm which outperforms the RISO algorithm [
64], in approximate matching as well as exact matching. This algorithm also surpasses the RISOTTO algorithm [
65] by presenting the real occurrence of the structured motif rather than presenting the relative number of occurrences as was suggested by RISOTTO. It operates with a variant of the data structure suffix tree consisting of the inverted index of symbol locations. This contributes to enumerating the structured motif through positional joins over the index.
SLI-REST [
67] (Suffix Link on Internal nodes, a Reverse Engineering Suffix Tree) applies the reverse engineering method on the suffix tree and links. It takes a tree of all input sequences, performs some modification to it, and searches for a word whose suffix tree is isomorphic to the input tree. It constructs the suffix tree in linear time by incorporating multiple suffix links or edges to all the internal nodes and investigates the suffix tree using the reverse engineering process through one novel approach. Before employing this novel approach, the author first presents the required constraint for a candidate tree and then defines a bicolored directed graph with proper labelling on its edges for every internal node which is the direct parent of the leaf nodes. Lastly, in the graph, the algorithm explores the particular feasible Eulerian routes, for which the traversed edge labels meet the required conditions and outputs a word achieving the given suffix tree and links.
PAMPA [
56], a branch and bound algorithm, enhances the PMSprune [
54] by efficiently managing the search space of the motif. The theoretical time and space requirements are the same as PMSprune, but in the practical aspect, it scales down to half for the challenging patterns. It describes Bd(
x), the d-neighborhood of an
l-mer
x, by introducing the extended
l-mers with “wildcards” to represent any symbol from the set of alphabets. Thus, the concept of “distance”
of the PMSprune is refined in PAMPA for extended
l-mers in a precise way and is used in the evaluation of
for the d-neighborhood Bd(
x).
PMS3P [
57] efficiently integrates the idea of PMS3 with the feature of PMSprune to handle the challenging motif instances. Making use of splitting concept of PMS3, it splits the
l-mers of sequence one into two-part u and v of size
l1 and
l2, respectively. Then utilizing the tree construction and pruning concept of PMSprune it constructs tree T(u) of height h for
l1-mer u and tree T(v) of height d-h for
l2-mer v. Then, individually, it explores both the trees T(u) and T(v) in a DFS way by applying the pruning technique. In the process of pruning of trees T(u) and T(v), it extracts the neighbors which are at Hamming distance h and d-h, respectively, from all input sequences, and saves them in groups Q′ and Q″, respectively. Then, the corresponding groups Q′ and Q″ of every
ith input sequence are merged to attain a list Ai in
Appendix A, and finally, all Ai lists are intersected to return the final planted motif. Though the computation time exceeds the previous algorithm like PMSP and PMSprune, the proficiency to handle the challenging instances is very high.
The provable [
58] algorithm employs the extended closest string (ECS) problem in place of the simple closest string problem in a recursive manner to extract the planted motif. Taking advantage of the two closest string algorithms, it extracts the center string from the group of input strings with some acceptable substitution. The author has proved its correctness by experimenting with some challenging instances on the real data set. The beauty of the algorithm lies in its extension version named center substring algorithm, which is powerful enough to explore all types of solutions to the problem. For every given sequence, the extended algorithm can find all possible center substrings and can transform into the fast exact algorithm of planted motif search with some modification.
qPMSprune [
55] is the first efficient exact algorithm to solve the planted quorum (
l, d, q) motif for the challenging instances with larger
l and d values. A planted quorum (
l, d, q) motif problem searches the length
l planted motifs which occur in a minimum of the q input sequences, with at most d Hamming distance. Among the t input sequences, the qPMSprune finds one particular sequence S
i, where 1 ≤
i ≤ (t − q + 1) and an
l-mer u of Si such that one of its neighbors is M and M is an (
l, d, q − 1) planted quorum motif of the given input sequences excluding Si. The algorithm processes every
l-mer of Bd(u) to identify the (
l, d, q − 1) motifs exploiting the pruning method of PMSprune [
54].
qPMS7 [
55] is a robust algorithm in the group of quorum planted motif discovery algorithms. It realizes and extends the idea of qPMSprune as follows: for the existence of two sequences S
i and
Sj for 1 ≤
i ≠ j ≤ t if
l-mer u and
l-mer v of sequences S
i and
Sj, respectively, M can be (
l, d, q − 2) planted quorum motif contingent upon M
{Bd(u) ∩ Bd(v)}. Hence as an extension, it scrutinizes over such (i, j) sequence pairs by constructing one cyclic graph Gd(u, v) of
l-mer u and
l-mer v. Then, it explores the Gd(u, v) graph in a depth-first way to recognize all components of Bd(u) ∩ Bd(v), and those can be (
l, d, q − 2) planted motifs.
Voting [
68] algorithm is a two-step algorithm: one is simple voting, and the other is an improved projected voting algorithm. The observation of a simple voting procedure states that for motif M and its d-neighborhoods N(M, d), if mi is a neighbor of M, then definitely M is a member in the group of N(mi, d). Due to this observation, it gives a thought to the search of motif that any
l-mer u present in N(M, d) should award one vote to other sequences
l-mer, and the
l-mer which receives a vote from all input sequences should be declared as the motif. The hash table manages and records the received votes of each
l-mer, and through the tracking process, it outputs the
l-mer having votes from every sequence as the motif. However, the computation time and storage requirement of the simple voting algorithm increase proportionate to the growth of
l and d values. To overcome this demerit, the author extends the simple voting to the improved projected voting algorithm by minimizing the set of
l-mers through the process of random projection. Then the author improves further by only considering the selected positions that are competent to the previously attempted positions.
PMS8 [
69] incorporates the novel idea of neighbor generation to introduce the exact and efficient solution to the planted (
l, d) motif search. It reveals the necessary and sufficient relation among three l-mers to get hold of common neighborhoods. The algorithm executes two steps: first, it applies the sample-driven approach to generate all sets of l-mers, and then, it applies the pattern-driven approach to extract the patterns present in all sequences. In the sample-driven step, it builds a matrix R of size t(n − l + 1) consisting of all l-mers of all sequences such that the
ith row of matrix R corresponds to all l-mers of ith sequences. Then, it randomly chooses one l-mer u from the first row, pushes it to the stack and pulls out all l-mers v from the matrix R which are at Hamming distance higher than 2d from u. It repeats this process for each row with one twist, i.e., it removes the l-mer v not present in the group of common neighbors. The transition from the sample-driven step to the pattern-driven step happens when the size of the stack attains the threshold value. In the pattern-driven step, it generates the common neighborhood of l-mers of the stack and inspects the existence of any planted (
l, d) motif in that group.
qPMS9 [
70] is the most efficient and recent quorum parallel planted motif search algorithm that can improve efficiently the computation time of PMS8 [
69]. In several ways, it extends PMS8, first, by introducing the state of art of string reordering mechanism which conspicuously improves the performance by employing the novel pruning method on the motif search space. Additionally, it assists the qPMS challenging instances that were missed in PMS8. The instances (28, 12) and (30, 13) are first solved by this technique with a reasonable time in a single-core machine.
Table 4 summarizes motif discovery algorithms based on exact motif finding approach.