1. Introduction
One of the main consequences of the constant progress of technology together with the massive use of computers in many aspects of our lives has been the creation of large repositories of data storing information of all sorts. A major problem related to these huge data sets is the one of discovering relevant patterns that separate the noise from important information and of deriving rules for clustering the data into classes sharing essential common features. To this aim, the fields of study known as
data mining [
1,
2] and
feature selection [
3,
4,
5] have recently emerged among the most relevant applications of modern computer science.
In this paper we focus on some mathematical issues that arise from data mining problems. A very common situation for data mining problems is to represent the starting information by a two-dimensional array, in which the rows correspond to samples (or individuals) while the columns correspond to their characteristics (also called features).
If the features are Boolean, one of the tools that can be used to extract interesting information is the so-called Logical Analysis of Data (LAD [
6,
7,
8]). Consider for instance a data set consisting of a binary matrix of
m rows and
n columns, in which some rows are labeled as
positive while the remaining rows are labeled as
negative (for instance, in the case of a molecular biology experiment using a device called “microarray” which measures the level of gene expression in cells, the values 0 and 1 would be related to the level being, respectively, “normal” or “abnormal” [
9,
10,
11]).
The Logical Analysis of Data has the objective of discovering a set of simple Boolean formulas (or “rules”) that can be used to classify new binary vectors
. Each rule describes what the value of some bits must be for a vector to be classified as positive or negative. For instance, a “positive rule” could be
meaning that any vector with a 1 in the 2nd component, and a 0 in the 5th and 9th component is classified as positive. Similarly, there can be some “negative rules” which specify which vectors should be classified as negative.
A rule such as the above can be conveniently represented by a
pattern, which is a string over the alphabet
-}. The characters
and
in the pattern specify which positions must be matched exactly by a binary vector to satisfy the rule, while the character - is a wildcard that can be indifferently matched by either
or
. In particular, if
the pattern corresponding to the above rule would be
If
r is a rule and
p is the pattern corresponding to
r, then a binary vector
b satisfies the rule if and only if
We say that the pattern p covers all vectors b for which the above holds. In view of the equivalence of rules and patterns, we can talk of positive/negative patterns in place of positive/negative rules.
The objective of LAD is to infer positive and negative patterns from the data in such a way that (i) each positive row is covered by at least one of the positive patterns, while no negative row is and (ii) each negative row is covered by at least one of the negative patterns, while no positive row is. This approach has been successfully applied to many contexts in both bioinformatics and biomedicine [
8].
Since there might be many alternative sets of patterns explaining a given instance of LAD, one has to introduce a suitable criterion for choosing a specific solution. In particular, an Okkam’s razor strategy would suggest seeking the simplest possible solutions, i.e., the sets with a minimum number of patterns. Finding a min-size set of patterns which cover a given set of vectors is called the Pattern Cover Minimality problem.
Other problems arising from the analysis of patterns are related to understanding whether two different sets of rules actually explain the same data set, or, in other words, the two pattern sets are equivalent. In particular we would like also to know whether a given set of rules explains all possible data, and so is in some sense “useless”. On the opposite side we would like to know whether there are some data that cannot be explained by a particular set of rules.
In addition, given a set of patterns we would like to know whether there exists another smaller set of patterns that explains the same data set. This problem that we call Pattern Equivalence Minimality looks similar to Pattern Cover Minimality. The difference is that here we start from a pattern set and not from a data set. Though patterns can be expanded into strings and we might solve a Pattern Cover Minimality problem from these strings, it is obviously computationally intractable expanding the patterns. Hence we should be able to find a better set of patterns starting directly from the given pattern set.
In the following we will review the computational complexity of these pattern problems, which are, in general, quite complex [
12] (see also [
13] for a fixed-parameter analysis of some related pattern problems). We then give an integer linear programming (ILP) formulation for Pattern Cover Minimality and for Pattern Equivalence and we address the effectiveness of our formulations by means of extensive computational experiments. An ILP formulation for Pattern Cover Minimality is also given in Boccia et al. [
14]. The formulation we propose in this paper reduces the problem to a Set Covering with a (low-degree) polynomial number of columns. Pattern Cover models for non-binary data are, e.g., a branch-and-price procedure described in [
15] and heuristic procedures proposed in [
16].
Formulating a solution procedure for Pattern Equivalence Minimality seems quite challenging since, as we will prove in the paper, this problem is NP-hard and co-NP-hard at the same time.
The paper is organized as follows. In
Section 2 we provide precise mathematical definitions of the concept we are dealing with and the related problems. In
Section 3 we investigate the computational complexity of the problems defined in the previous section. In
Section 4 we investigate strings and patterns that are mutually compatible. In particular, we provide a polynomial algorithm to list all patterns that are compatible with a given set of strings (the inverse problem of listing all strings compatible with a set of patterns is necessarily exponential). In
Section 5 we give ILP models both for the Pattern Cover Minimality and Pattern Equivalence problems. These procedures are tested in
Section 6 devoted to the computational experiments. Some conclusions are drawn in
Section 7.
2. Basic Definitions
A
binary string (or, simply, a
string)
s is a sequence of symbols where each symbol can be either
or
. With
n-binary string we denote a binary string of length
n. An
n-pattern (or simply a
pattern p is sequence of symbols where each symbol can take the values
,
or -, i.e.,
We call the symbol - a
gap. With
n-pattern we denote a pattern of length
n. Notice that a string is in fact a particular pattern, i.e., a pattern without gaps. A pattern
p covers, or
generates, a string
if
for each
k such that
. The
span of
p is the set of all binary strings generated by the pattern
p, i.e.,
Given a set
P of patterns the span
of
P is the set
Two sets P and Q of patterns are equivalent if . A set of patterns P is a minimum set if for each set of patterns Q equivalent to P.
We say that a pattern p is compatible with a set of strings S if . Similarly we say that a string s is compatible for a pattern p if . We denote by the set of all compatible patterns for S. Let p be a pattern compatible with a given set S of strings. If there is no compatible pattern such that , we say that p is maximal (for S). We denote by the subset of maximal patterns in . Given a set S of strings and a set P of patterns we say that P is compatible for S if each is compatible for S and so and .
Moreover, a set of patterns P is called a cover of S if . Notice that all covers of S are equivalent to each other and S, viewed as a set of patterns, is equivalent to each of its covers.
A set P of patterns is said to be complete if , i.e., if P generates all possible n-binary strings. Clearly, is trivially complete.
We note that S is a set and so it does not contain duplicate strings. We assume this to be true also when we represent a set of m strings of length n as an array of zeros and ones.
3. Computational Complexity Results
The previous definitions lead to the following decision problems [
12]:
PATTERN COVER: given a set S of strings and a set P of patterns, is P a cover of S, i.e., ?
PATTERN COVER MINIMALITY: given a set S of strings and a constant K, does there exist a cover P of S such that ?
PATTERN EQUIVALENCE: given two sets P, Q of patterns, are they equivalent, i.e., ?
PATTERN EQUIVALENCE MINIMALITY: given a set P of patterns and a constant , does there exist an equivalent set of patterns Q such that ?
PATTERN COMPLETENESS: given a set P of patterns, is it complete, i.e., ?
PATTERN INCOMPLETENESS: given a set P of patterns, is it not complete, i.e., does there exist a string such that ?
We first note that, given a set
P of patterns and a string
, determining whether
or
is polynomial. Indeed, given a pattern
p and a string
s we may check in time
whether
s can be generated by
p or not. Hence, given a set
P of patterns we have to repeat the check for each
. If the check is false for each
we have
, otherwise we have
.
Proposition 1. PATTERN COVER is polynomial.
Proof. For each we check whether or not. Hence in time we may decide whether or not. In order to decide whether or not, for each pattern we count the number of strings in S compatible for p. Let k be the number of gaps in p. Then p is compatible for S, i.e., , if and only if . Computing can be done in time . Overall, checking whether takes time. □
Proposition 2. PATTERN COVER MINIMALITY is NP-complete.
Proof. We observe that PATTERN COVER MINIMALITY is basically the same as MINIMUM DISJUNCTIVE NORMAL FORM (see [
17] p. 261), which we repeat here for the sake of completeness: Given a set
of variables, a set
of “truth assigments”, and an integer
, does there exist a disjunctive normal form expression
E over
U, having no more than
K disjuncts, which is true for precisely the assignments in
A and no others?
We show the equivalence of the two problems by the following map which builds a PATTERN COVER MINIMALITY instance. Each element becomes an input binary string (with representing false, and representing true), while each disjunct d is mapped into a pattern p such that if appears in d, if appears in d and a otherwise. □
Proposition 3. PATTERN INCOMPLETENESS is NP-complete.
Proof. We reduce SAT to PATTERN INCOMPLETENESS. Given a SAT instance with n literals and m clauses we derive a set P of m patterns , , (each pattern associated to each clause), as follows: for each variable i and each clause k, if the literal is present in the clause, we set , if the literal is present in the clause, we set , and if neither nor are present in the clause, we set (note that the pattern values are set opposite to the truth values of the literals ).
Assume SAT is satisfiable and let x be a satisfying truth assignment. Define a string s as if and if . By assumption, at least one literal of each clause must be true, and so for each at least one of the symbols corresponding to , positions of must be different from , due to the particular construction of . It follows that s cannot be in for all k and so s cannot be in . In a similar way, given a string s not in we can reverse the previous reasoning and obtain a satisfying truth assignment for the SAT instance.
To see that PATTERN INCOMPLETENESS is also in NP it suffices to observe that verifying that a string does not belong to takes polynomial time, as previously described. □
Since PATTERN INCOMPLETENESS and PATTERN COMPLETENESS are complements of each other, we have:
Corollary 1. PATTERN COMPLETENESS is co-NP-complete.
Proposition 4. PATTERN EQUIVALENCE is co-NP-complete.
Proof. We transform PATTERN COMPLETENESS into PATTERN EQUIVALENCE. Given a set P of patterns, instance of PATTERN COMPLETENESS, the corresponding instance of PATTERN EQUIVALENCE consists of the set P plus a set Q containing only the pattern (which generates ). For a no-instance, there exists a string and , or vice versa, and this s is a short certificate. □
Proposition 5. PATTERN EQUIVALENCE MINIMALITY is co-NP-hard.
Proof. We describe a transformation from PATTERN COMPLETENESS. Given an instance of PATTERN COMPLETENESS we define a corresponding instance of PATTERN EQUIVALENCE MINIMALITY by choosing . Without loss of generality, we may assume that for each i, the values ’s across all are not all identical (since, otherwise, we may discard each position where all symbols are equal and reduce the instance to an equivalent one). At this point, the only pattern that can be equivalent to P is . □
Since PATTERN COVER MINIMALITY is a particular case of PATTERN EQUIVALENCE MINIMALITY we have by Proposition 2:
Proposition 6. PATTERN EQUIVALENCE MINIMALITY is NP-hard.
By Propositions 5 and 6, PATTERN EQUIVALENCE MINIMALITY is both NP-hard and co-NP-hard. To date it is not known whether the classes of NP-complete problems and co-NP-complete problems coincide or are disjoint. The widely believed conjecture is that they are disjoint. Following this conjecture we conclude that it is unlikely that PATTERN EQUIVALENCE MINIMALITY is in NP or in co-NP, and so we expect its complexity to be beyond the classes NP and co-NP.
It is obvious that, given a set P of patterns, generating takes exponential time in general for the mere fact that can be of exponential size. It is perhaps surprising that the reverse, i.e., given a set of strings S, generating is polynomial. Indeed it turns out that is of polynomial size and also the algorithm that generates is polynomial. We devote the next section to this issue.
4. Compatible Patterns
We describe a procedure to compute , that is the set of all compatible patterns for a set S of strings. The analysis of this procedure shows that the number of compatible patterns is polynomial ().
We define a recursion that produces a set
of patterns and we will show that
. We assume there are no duplicates in
S. The recursive calls create string sets that satisfy this property. The length of each string in a generic set
R of strings (clearly all of equal length) is denoted by
. For a generic set
R and
,
is the set of strings of length
obtained from
R by taking, for each
, only the elements
. Furthermore, for
s a string and
a set of strings, we denote by
the set obtained by appending
s as a prefix to all strings in
X.
The recursive algorithm to compute consists of:
base case: if then return (the entire S, seen as a single string);
recursion: given an input set R of strings let be the first index such that there are two strings in R whose c-th elements are different (if there is not such an index all strings in R would be identical, contradicting the hypothesis of no duplicates). Hence all prefixes are equal for each . Let be this common prefix. Let be defined as follows:
Let and . Then (recursive call).
Let and . Then (recursive call).
Let be the set of all strings for which there exists such that for all (note that s and differ only at the c-th element and that the strings s and (if any) go in pairs due to the hypothesis of no duplicates) and let (recursive call).
Ifthen else .
Then, return
Theorem 1. Let S be a string set and let be the set produced by the recursive algorithm on S. Then .
Proof. We use induction on . Base case: If the assert is clearly true.
Inductive step: Assume
and the theorem holds for all sets with
strings. With the notation of the algorithm,
Let . Since s (possibly null) is a prefix of each string in S, p must start with s as well. Assume with . If then q is a pattern compatible for and, by induction, . Therefore, , so that . A similar argument shows that also if it is . Now, if , then for each string a generated by q, the pattern p generates both and . Therefore, a had to be a string both in and in . This means that q had to be a pattern compatible for . Since, by induction, all such patterns are in then so that .
Now, assume . Let , with . If , let be a string generated by p. Then and therefore , so . If let be a string generated by p. Then also is generated by p. Then so that and . □
Theorem 2. .
Proof. Let
be an upper bound to the cardinality of
for a set
S with
. From the previous theorem and from the algorithm recursive calls we see that
By applying the Master Theorem (see p. 66 [
18]) we get
□
These are the first values of
for
, according to (
1):
This the sequence A006046 as listed in the On-line Encyclopedia of Integer Sequences (OEIS) and our derivation seems to add a new meaning to that sequence. Note from (
2) that in particular for
we have
. Clearly, if
S consists of all strings of length
k, so
, then
. So the bound
for
is strict in this particular case. In fact, we can prove that the bound is strict for every
m and not just for
. Consider the following Procedure 1 that generates a set
S of
strings of size
m.
Procedure 1 |
begin if then return (1); else begin ; ; ; ; if then ; ; ; return (); end end |
These are the sets produced by the procedure for
, for which
It is easy to show that for the sets produced by the above procedure, and so the bound is strict for all m. However, in most cases, as for covering problems, we may be interested only in maximal patterns and we may wonder how the number of maximal patterns grows with m.
In the particular case of the strings produced by the above procedure, which corresponds to the worst case in terms of generic patterns, the number of maximal patterns grows very slowly, indeed as .
In the following table we report the value
m,
, and the number
of maximal patterns for the above set of strings
It turns out that
is equal to the number of ones in the binary expression of
m. However, the number
can be much larger for other sets of strings. There are cases such that
. For instance, consider the following case for which
,
and
:
Although it may happen that
in general, we may show that a cover can always be obtained by less than
m patterns if for each string there is a compatible pattern with at least one gap that covers it. Let us call a one-pattern a pattern with exactly one gap. Consider the set of all compatible one-patterns for a given set of
m strings of
n symbols. By assumption, each string is covered by a one-pattern (if it is covered by a pattern with gaps it is also covered by a one-pattern). Build a graph
with
V the set of strings and
E the set of string pairs spanned by a one-pattern. This graph is bipartite because we may partition the strings into "even" and "odd" strings according to the number of ones in the string. A one-pattern necessarily spans an even string and an odd one. Moreover, there are no isolated vertices by assumption. A cover consisting of one-patterns corresponds to an edge cover of
G. The cardinality of an edge cover is given by
with
M the cardinality of a maximum matching. Since
we need at most
one-patterns to cover all strings. For the above example these are two alternative minimum covers with
patterns.
Therefore, if more than m patterns are needed to cover a set of m strings, this means that there are some special strings that, loosely speaking, cannot be explained by some rule and require a particular pattern that coincides with the string. Furthermore, the bound is obtained by considering only one-patterns. If, as we presume, the data sets are explained by more interesting patterns with many gaps, the size of a minimum cover can be significantly less than m.
5. ILP Models
In this section we provide ILP models for two of the problems defined in
Section 3, namely PATTERN COVER MINIMALITY and PATTERN EQUIVALENCE. The first problem is NP-complete and the second one is co-NP-complete. So it is not inappropriate to use ILP models for their solution. On the contrary, we believe that PATTERN EQUIVALENCE MINIMALITY cannot be expressed as an ILP model. We have already stressed the fact that this problem is both NP-hard and co-NP-hard and we have also observed that, given the current state of the art in computational complexity, we believe that it does not belong to NP or co-NP. Since ILP problems belong to these classes, we have strong reasons to doubt about the possibility of solving PATTERN EQUIVALENCE MINIMALITY by ILP models.
5.1. ILP for Pattern Cover Minimality
We approach the problem of finding a pattern set P of minimum cardinality that spans a given set of strings S as a 01LP set cover problem, in which each row is associated to each string of the string set, each column is associated to each compatible pattern for S, and the entry of the 01 LP matrix is 1 if and only if the pattern j covers the string i. In view of Theorem 2 the matrix has a polynomial number of columns and therefore it can be explicitly written. We note that it is not strictly necessary to generate the full matrix. We may use a column generation approach by adapting the algorithm that generates all pattern to the pricing problem given dual variables associated to the strings. However, in our computational experiments we have seen that generating the full matrix and then solving the problem outperforms the column generation approach, which requires running the recursive algorithm for each column generation, while only one run is necessary for generating the full matrix.
5.2. ILP for Pattern Equivalence
We assume that two sets
P and
Q of patterns are given. We introduce the following models
We have the following result:
Proposition 7. – if and only if and ;
– if and only if and ;
– if and only if and .
Proof. We reiterate that ⊂ means strict inclusion. It is sufficient to prove that if and only if . If then x is generated by . The constraint implies that x is generated by at least one pattern in P. Hence feasible x are in . Consider now any . If x is generated by then , while if x is not generated by then is feasible (along with possible integer values ). The objective function forces to be zero in this case.
Therefore, if and only if and . If , for any pattern we have that , i.e., . □
Note that, if
, i.e., when
, the model (
3) yields also a string
x in
but not in
, whereas if
, i.e., when
, model (
3) yields also a string
x in both
and
. Similarly if
, i.e., when
, model (
4) yields also a string
x in
but not in
, whereas if
, i.e., when
, model (
4) yields also a string
x in both
and
.
We may further distinguish the case
,
, via the following model
The following proposition follows easily from Proposition 7.
Proposition 8. and are disjoint if and only if .
When and are not disjoint, the model yields a string x shared by both sets.
If we consider model (
4) and take
we are actually solving the problem FULL PATTERN COVERAGE together with its complement PARTIAL PATTERN COVERAGE. Hence (
4) becomes
and we may conclude, as a Corollary of Proposition 7, that
Proposition 9. if and only if .
As a simple example of the previous results suppose we are given the two following sets of patterns
P and
Q.
We run in sequence (
3) and (
4) and obtain
and
, that implies
, according to Proposition 7. In this case there is no need of running (
5). As a byproduct we obtain from (
3) and (
4) respectively the strings
One can easily check that
and
and also that
. Moreover, if we run (
6) for
P we obtain
, that implies
and also the string
as a certificate that
does not contain all strings.
If we are given the two sets of patterns
and run in sequence (
3) and (
4) we obtain
and
. Hence we have to run also (
5) and obtain
. This means that
and
are not disjoint. We may exhibit also the string
that belongs to their intersection. Moreover, from the previous (
3) and (
4) we also have the two strings
,
and
,
. As a final output we may run (
6) for
and obtain
and
:
6. Computational Experiments
We have carried out computational experiments for PATTERN COVER MINIMALITY and PATTERN EQUIVALENCE. The problem PATTERN COVER is polynomial and we felt no need to perform computational experiments for this problem. On the opposite side, problem PATTERN EQUIVALENCE MINIMALITY seems to be intractable and we have not even devised ideas of how to solve it. Problems PATTERN COMPLETENESS and PATTERN INCOMPLETENESS are particular cases of PATTERN EQUIVALENCE.
Our tests were run on an Intel Core i5 machine 2.3 GHz with 8 GB Ram. The program was implemented in C++ and we used Cplex 12.4 as the ILP solver.
6.1. Pattern Cover Minimality
We approach the problem of finding a pattern set P of minimum cardinality that spans a given set S of strings as a 01LP set cover problem, in which each row is associated to each string of the string set, each column is associated to each compatible pattern for S, and the entry of the 01 LP matrix is 1 if and only if the pattern j covers the string i. In view of Theorem 2 the matrix has a polynomial number of columns and therefore it can be explicitly written. We note that it is not strictly necessary to generate the full matrix. We may use a column generation approach by adapting the algorithm that generates all patterns to the pricing problem given dual variables associated to the strings. However, we have seen that generating the full matrix and then solving the problem outperforms the column generation approach, which requires running the recursive algorithm for each column generation, while only one run is necessary for generating the full matrix.
We fix the size of a string to . Each string is randomly generated by independently setting each bit to 1 with probability p (and to 0 with probability ). A random instance consists of a set S of m randomly generated strings without duplicate strings. We consider the following values: and . For each combination of values of p and m we generate ten instances.
The strings generated with a value of p close to 0 (or equivalently close to 1) tend to be similar whereas they are much less similar for . Similar instances are expected to be covered with a few patterns with many gaps, whereas non-similar instances are expected to be covered with many patterns with few gaps.
By the recursive procedure described in
Section 4 we compute for each
S all compatible patterns
and solve with cplex the corresponding set cover problem.
The computational results are reported in
Table 1. For each combination of
p and
m we report for each one of the ten instances the resulting number of compatible patterns (
), the optimal value of the minimal cover problem (opt), the total cpu time in seconds consisting of the pattern generation procedure plus the cplex run (time), and the number of nodes (root excluded) of the branch-and-bound process (#nodes). A value of #nodes equal to zero means that the solution of the LP relaxation was already integer.
As can be seen from
Table 1, all instances are solved at the branch-and-bound root node except the case
and
.
6.2. Pattern Equivalence
One of the main difficulties for testing the models for PATTERN EQUIVALENCE is creating sensible instances which show that the ILP model is indeed effective.
In fact, a major objection that one might have versus the use of ILP is that—when the maximum number of gaps in the patterns is not “large enough”—a simple enumerative approach might prove quite effective, and much better than ILP, even if there are a lot of patterns and n is quite big. Assume, for example, to compare two sets of patterns of about 1000 patterns each with , and each pattern has at most ten “-” in it. Ten gaps can be expanded in 1024 ways, and so each set of patterns yields at most about 1,000,000 strings, which most computers can generate in a second. Then, we just need to check whether these two sets of strings have the same size (if not, we stop) and, if they do, we compare each element of the first to each one of the second and stop as soon as one element of the first is not in the second (perhaps by first sorting the two sets and then scanning the sorted lists). Some data structures might be more effective than others for these operations, but, bottom line, it is a very fast process that ILP has a hard time beating.
Therefore, we want to show that ILP is the way to go when the naïve approach cannot work, namely, when the patterns have so many gaps in them that a complete expansion (which exponentially increases the data size) is out of question. This poses the problem of how to create non-trivial, interesting instances of equivalent pattern sets which have a large number of gaps.
6.3. Diagonal Instances
A simple way of creating instances with equivalent pattern sets is as follows. For every n, we consider two equivalent sets of patterns, which generate all strings with the exception of the string . We call these diagonal instances.
The first set has
n patterns, with a maximum number of gaps
, and is the following (exemplified for
):
The second set has
patterns, with a maximum number of gaps
, and is
We perform a sequence of tests to compare the ILP approach with the complete enumeration algorithm, for increasing values of
n. These diagonal instances turn out to be very easy for the ILP model. They are all solved in less than 0.1 s, for
as can be seen from
Table 2. The enumerative approach, however, becomes very soon impractical. For
the algorithm takes already more than 15 min, while for
the algorithm has not finished after one hour (which we set as a maximum time limit). It is interesting to notice how the ILP approach solves this instance in less than a second also for
, while the enumerative approach would have to generate
strings.
6.4. Generating Equivalent Pattern Sets in General
We can adopt the following strategy to create two sets of equivalent patterns:
We generate a small starting set P (e.g., ) of random patterns. Each pattern is obtained by setting each bit to “-” with some probability q, to 0 with some probability , and to 1 with probability . Since we are interested in patterns with many gaps, we set q to a large value (e.g., 0.8). Let g be the minimum number of gaps appearing in some pattern of P.
The patterns in P are expanded in all possible ways yielding a set S of strings.
We compute with the recursive procedure described in
Section 4 (slightly modified) the set
of all patterns compatible with
S which have at least
g gaps each (this ensures that it is always possible to cover
S with patterns in
).
We compute two random solutions of the set covering problem. Namely, from we select (picking patterns at random until we have a cover) two subsets that are covers of S.
are equivalent by construction and have a fairly large number of gaps in each pattern.
In our implementation, because of memory problems, the above procedure works for . Thus, to build larger instances we use a trick. Namely, we create instances starting from instances built as above and then combining them into larger and larger ones as explained below.
6.5. How to Boost Instances of Pattern Equivalence
One way to increase the number of gaps in the instances would be to take two set of equivalent patterns A and B and suffix each pattern with a list of k gaps. This, however, yields very particular, uninteresting, instances. In order to obtain more elaborate, hard pattern equivalence instances, we have developed the following scheme.
Given a set of patterns X, denote by
- the number of columns (i.e., string length),
- be the number of rows (i.e., of patterns),
- the maximum number of gaps in some pattern.
Furthermore, given sets of patterns
A and
B, denote by
the set of patterns
Note that
,
and
. We have
Claim 1. Let be sets of patterns such that is equivalent to and is equivalent to . Then is equivalent to .
Proof. Let . We want to show that . We show since the other direction is symmetrical. Let . In particular, and . Since is equivalent to also and similarly . Hence . □
By using this trick repeatedly, we can snowball from small instances, e.g., 4 or 5 patterns with 5 or 6 gaps each, to instances with a few hundred patterns with more than 20 gaps each.
6.6. Experiments
We have created 10 instances of size
each. Each instance is built by combining two equivalent instances of size
each, which were built with our procedure with parameters
,
and
. The results are reported in
Table 3. These instances turned out to be too difficult to be solved by the enumerative approach in less than half hour each.
For each set of input patterns () we have:
is the number of input patterns.
is the minimum number of gaps per pattern.
is the maximum number of gaps per pattern.
is the average number of gaps per pattern.
The ten instances are reported in
Table 3 sorted by running times (9-th column). The 10-th and 11-th columns report the number of branch-and-bound nodes required (root node excluded) for solving models (
3) and (
4), respectively.
The results show that the ILP approach is effective also for instances which are large and enough and cannot be tackled by enumerative approaches.
In order to test the same ILP models in case the pattern sets are not equivalent we have randomly perturbed the previous data. The running times remained practically the same.
7. Conclusions
In order to use feature selection and LAD in the analysis of binary data consisting of positive and negative samples, one has to identify which computational problems might arise and how to overcome them. One of the issues that we have addressed in this paper is in fact the computational complexity of the problems, which we have shown to be, in general, very hard. As a viable approach to the effective solution of some of these problems, we have described integer linear programming formulations. In particular, we have given ILP models for the problem of determining if two sets of patterns are equivalent and for finding a min-size set of patterns which explain a given data set. A striking consequence of our complexity results is that there could be no simple ILP model for finding a minimal set of patterns explaining the same data set explained by a given pattern set. Developing some procedures for this last problem could be a line of future research.