4.1. Exploiting Label Group Correlation
To investigate label group correlation, in this subsection, we introduce a graph-based method to further distinguish the relevant labels, which can effectively differentiate relevant labels by grouping strongly related labels together and separating weakly related ones into different groups. The process involves two fundamental steps: (1) constructing an undirected graph of the labels, and (2) partitioning the graph to create distinct label groups.
In the first step, OSLGC aims to construct an undirected graph that effectively captures the label correlation among all labels, thus providing additional information for streaming feature evaluation. For this purpose, it is necessary to investigate the correlation between labels.
Definition 6. Given , , , represents the value of label with respect to instance , the correlation between the labels is defined as: Obviously, if and are independent, then ; otherwise, .
Using Equation (6), the label correlation matrix
is obtained, and the form of
is shown below.
Based on the matrix, the weighted undirected graph of the label correlation can be structured , where and mean the vertex and edge of , respectively. As is symmetric, is an undirected graph that reflects the correlation among all labels. But, regrettably, has m vertices and edges. For ultra-high-dimensional data, the density of the graph will be considerable, which often leads to strong edge interweaving of different weights. Moreover, the resolution of complete graphs is an NP-hard problem. Therefore, for , it is necessary to reduce the edges of .
In the second step, OSLGC aims to divide the graph and create label groups. With this intent, we first generate a minimum spanning tree () through the Prim algorithm. has the same vertices as and partial edges of . The weight of the link edge in the is expressed as , which is essentially different for different edges. To divide strongly correlated labels into groups, we set the threshold to break the edges below the threshold in .
Definition 7. Given represents the weight of the edges, and the threshold for weak label correlation is defined as: δ is the average of the edge weights, which is used to divide the label groups, thereby putting the strongly related labels in the same group.
Concretely, if , which means that the relationship between labels and is a strength label correlation, we will reserve the edge that connects with . If , which explains that the relationship between labels and is a weakness label correlation, we can remove the edge that connects with from MST. Hence, the MST can be segmented into forests by threshold segmentation. In the forest, the label nodes within each subtree are strongly correlated, while the label nodes between subtrees are weakly correlated. Based on this, we can treat each subtree as a label group, denoted as .
Example 1. A multi-label dataset is presented in Table 1. First, the label correlation matrix is calculated using Equation (6), as follows: Then, we can create the label undirected graph by using the label correlation matrix, as shown in
Figure 1a. Immediately afterwards, the minimum spanning tree is generated by the Prim algorithm, as shown in
Figure 1b. Finally, the threshold
of
MST is calculated using Equation (7), and the edges that meet condition
are removed, as shown in
Figure 1c.
4.2. Analysis Feature Interaction under Label Group
As a rule, the related labels generally share some label-specific features [
17,
33], i.e., labels within the same label group may share the same label-specific features. Thus, to generate label-specific features for different label groups, in this subsection, we will further analyze feature relationships under different label groups, including feature independence, feature redundancy, and feature interaction. We also give the interaction weight factor to quantify the influence degree of the feature relationship under different label groups.
Definition 8 (Feature independence)
. Given a set of label groups , , denotes the selected features, and is a new incoming feature at time t. For , and are referred to as feature independence under if, and only if: According to Definition 8, suggests that the information provided by feature and for the label group are non-interfering, i.e., the features are independent of each other under label group .
Theorem 1. If or , then and are independent under label group .
Proof. = + . If or , . Thus, and are independent under label group . □
Theorem 2. If and are independent, under the condition that label group is known, then .
Proof. If and are independent, i.e., , according to Definition 5, it can be proven that .
□
Definition 9 (Feature redundancy)
. Given a set of label groups , , denotes the selected features, and is a new incoming feature. For , and are referred to as feature redundancy under if, and only if: Equation (9) suggests that there is partial duplication of information provided by two features; that is, the amount of information brought by two features and together for label group is less than the sum of the information brought by the two features for separately.
Theorem 3. If or , then the relationship between and is a pair of feature redundancy under label group .
Proof. = + . If or , . Thus, the relationship between and is a pair of feature redundancy under label group . □
Definition 10 (Feature interaction)
. Given a set of label groups , , denotes the selected features, and is a new incoming feature. For , and are referred to as a feature interaction under if, and only if: Equation (10) suggests that there is a synergy between features and together for label group ; that is, they yield more information together for label group than what could be expected from the sum of and .
Theorem 4. If or , then and is a pair of feature interaction under label group .
Proof. . If or , . Thus, and are a pair of feature positive interaction under label group . □
Property 1. If two featuresandare not independent, the correlations betweenandunder a different label groupare distinct. It is easy to show with Example 2.
Example 2. Continue Table 1. As shown in Table 2, we can see that is less than , and, according to Definition 9, and is a feature redundancy under label group . However, for label group , it satisfies that ; that is, and is a feature interaction under the label group . This finding suggests that the relationship between and changes dynamically under different label groups. Consequently, to evaluate features accurately, it is imperative to quantify the influence of the feature relationships on feature relevance. That is, the inflow of a new feature has a positive effect in predicting labels, and we should enlarge the weight of feature ; otherwise, the weight of feature should be reduced. The feature interaction weight factor is defined to quantize the impact of the feature relationships as follows:
Table 2.
The relationship between features.
Table 2.
The relationship between features.
Mutual Information | Combination | Feature Relationship |
---|
| | Feature redundancy |
|
| | Feature interaction |
|
Definition 11 (Feature Interaction Weight)
. Given a set of label groups , , denotes the selected features, and is a new incoming feature. For , the feature interaction weight between and is defined as: offers additional information for evaluating feature . If feature and the selected feature is independent or redundant, it holds that . However, if the feature relationship is interactive, it holds that .
4.3. Streaming Feature Selection with Label Group Correlation and Feature Interaction
Streaming features refer to features acquired over time; however, in fact, not all features obtained dynamically are helpful for prediction. Therefore, it is necessary to extract valuable features from the streaming features’ environment. To achieve this purpose, in this paper, we implement the analysis of streaming features in two stages: online feature relevance analysis and online feature interaction analysis.
4.3.1. Online Feature Relevance Analysis
The purpose of feature relevance analysis is to select features that are important to the label groups. Correspondingly, the feature relevance is defined as follows:
Definition 12 (Feature Relevance)
. Given label groups , is a new incoming feature, the feature relevance item is defined as: In which, denotes the weight assigned to each label group, and where is the information entropy of label group . The higher the weight of the label group, the more important the label group is to other label groups. In other words, the corresponding label-specific features of the label group should have higher feature importance.
Definition 13. Given label groups , is a new incoming feature, and is the feature relevance. With a pair of thresholds α and β (), we define:
(1) is strongly relevant, if ;
(2) is weakly relevant, if ;
(3) is irrelevant, if .
In general, for a new incoming feature , if is powerfully relevant, we will select it; if is irrelevant, we will directly abandon it and no longer consider it later; if is weakly relevant, there is a risk of greater misjudgment in making a decision immediately, including selecting or abandoning, and the best approach is to obtain more information to make a decision.
4.3.2. Online Feature Interaction Analysis
Definition 13 can be used to make intuitive judgments about features that are weakly correlated. However, Definition 13 does not provide a basis for selecting or abandoning weakly relevant features. Therefore, it is necessary to determine whether to remove or retain the weakly relevant features.
Definition 14. Given label groups , denotes the selected features, and is a new incoming feature. The feature relevance when considering feature interaction, called the enhanced feature relevance, is defined as: In which, is the feature interaction weight between and . Furthermore, to determine whether to retain the weakly relevant feature, we set the mean value of feature relevance about the selected features as the relevance threshold, as follows:
Definition 15. Given denotes the selected features, , at time t, the mean value of the feature relevance about the selected features is: Obviously, when , it shows that the weak relevant feature interacts with the selected features. In this case, can enhance the prediction ability and be selected as an effective feature. Otherwise, when , it denotes that adding the weakly relevant feature does not promote the prediction ability for labels, and, in this case, we can discard the feature .
4.3.3. Streaming Feature Selection Strategy Using Sliding Windows
According to Definition 13, two main issues need to be addressed: (1) how to design a streaming feature selection mechanism to discriminate the newly arrived features; (2) how to set proper thresholds for and .
(1) Streaming feature selection with sliding windows: To solve the first challenge, a sliding window mechanism is proposed to receive the arrived features in a timed sequence, which is consistent with the dynamic nature of the streaming features. The specific process can be illustrated using the example in
Figure 2.
First, the sliding window (SW) continuously receives and saves the arrived features. When the number of features in the sliding window reaches the preset size, the features in the window are discriminated, which includes decision-making with regard to selection, abandonment, or delay.
According to the feature relevance
(Definition 12), we select the strongly relevant features, as shown in
Figure 2. We can straightforwardly select strongly relevant features, e.g.,
and
. Similarly, for the irrelevant features, they are discarded from the sliding window, e.g.,
and
.
For weakly relevant features, we need to further analyze the enhanced feature relevance by considering the feature interaction. If the weakly relevant features satisfy the condition , they can be selected, e.g., ; otherwise, the weakly relevant features are retained in the sliding window, for example, , and new features are awaited to flow into the sliding window.
This process is performed repeatedly. That is, when the features in the sliding window reach saturation or no new features appear, the next round of feature analysis is performed.
Figure 2.
Streaming feature selection with sliding window.
Figure 2.
Streaming feature selection with sliding window.
(2) Thresholds setting of and : To solve the second challenge, we assume that the experimental data follow a normal distribution and the streaming features arrive randomly. Inspired by the 3 principle of normal distribution, we set and as the mean and standard deviation of features in the sliding window.
Definition 16. Given a sliding window , , and is the feature relevance, then, at time t, the mean value of the sliding window is: Definition 17. Given a sliding window , , and is the feature relevance, then, at time t, the standard deviation of the sliding window is: Therefore, we combine the 3 principle of normally distributed data to redefine the three feature relationships.
Definition 18. Given is the feature relevance, at time t, and are the mean and standard deviation of the features in the sliding window. Then, we define three feature relationships as:
(1) is strongly relevant, if ;
(2) is weakly relevant, if ;
(3) is irrelevant, if .
Through the above analysis, we propose a novel algorithm, named OSLGC, as shown in Algorithm 1.
Algorithm 1 The OSLGC algorithm |
Input: : sliding window, : predictive features, L: label set. Output: : the feature subset at time t.
- 1:
Generate label groups by Section 4.1; - 2:
repeat - 3:
Get a new feature at time t; - 4:
Add feature to the sliding window ; - 5:
while is full or no features are available do - 6:
Compute and ; - 7:
for each do - 8:
if then - 9:
if then - 10:
; - 11:
end if - 12:
else - 13:
Discard ; - 14:
end if - 15:
end for - 16:
end while - 17:
until No features are available; - 18:
Return ;
|
The major computation in OSLGC is feature analysis in sliding windows (Steps 5–16). Assuming is the number of currently arrived features, and is the number of labels, in the best-case scenario, OSLGC obtains a feature subset after running online feature relevance analysis, and the time complexity is . However, in many cases, the features are not simply strongly relevant or irrelevant, but include weakly relevant instances. Therefore, online feature interaction analysis needs to be further performed. The final time complexity is .