In this section, we present the model formulation of RSRRW firstly and then its detailed optimization procedure.
3.1. Model Formulation
The RSRRW model framework is shown in
Figure 2, in which the model part mainly includes four components, namely sample weight learning, feature weight learning,
-dragging process, coefficient matrix learning. The functions of each part are summarized as follows. (1) Based on sample weight learning, add a probability weight to each sample point in the data. When the sample is a normal point, its probability weight is 1, otherwise, its probability weight is 0. (2) Based on feature weight learning, differentiate the contributions of features. (3) In the
-dragging process, the regression targets of different categories are forced to move in opposite directions, expanding the distance between categories. (4) Coefficient matrix learning, using the least squares method to learn
and
.
The
-dragging strategy was proposed in [
40], and its mathematical form is shown in
where
is original coefficient matrix.
Considering that EEG data is a multi-channel time series data, features from different frequency bands and leads have different contributions to specific tasks, and the objective function (
7) cannot distinguish the importance of different features, which is defective. Therefore, we introduce the sample feature weight representation matrix
to describe the importance of sample features, and construct the following objective function,
Considering the weak and unstable characteristics of EEG data, the quality of different samples varies greatly. In order to improve the robustness of the model to samples, we introduce the sample probability weight
, so that the model can automatically filter outliers during the training process. The optimized objective function is
In model (
9),
is the sample weight,
k is the number of normal points in the sample,
is the
i-th column of
. However, model (
9) can only be used in a supervised setting, here we extend it to a semi-supervised setting,
By setting
, we have
and then the above objective function can be rewritten as
When
are fixed, because
and
, the second term of objective function (
11) can be rewritten as
According to the Lagrange multiplier method, we write the Lagrangian function of (
12) about
and set its derivative to
to 0, then we can get
With the above solution to
, objective function (
12) is equivalent to
Now model (
11) can be rewritten as
3.2. Model Optimization
There are five variables in model (
15) that need to be optimized, namely
,
,
,
,
. For this problem, we designed a joint iterative optimization algorithm, which divides the above problem into four sub-problems to solve them separately and performs multiple iterations until convergence.
Performing simple variable substitution on objective function (
15), we can get
where
,
is the
i-th column of
,
and
is the function of
e,
.
We introduce an effective algorithm [
41] to solve objective function (
16). The general problem is as follows
The form of
x and
could be scalar, vector or matrix. Algorithm 1 describes the detailed procedures. By comparison of (
16) and (
17), we can find
,
,
in problem (
17) as
,
, and
in (
17), respectively.
Algorithm 1 Solution of (17). |
- Input:
Initialize - Output:
The optimal x - 1:
while not converged do - 2:
Calculated ; - 3:
Solve the following minimization problem ; - 4:
end while
|
Firstly, we need to calculate
. The formulation of
is
Secondly, we need to solve
To facilitate subsequent calculations, we rewrite (
19) into matrix form
where
,
,
. Considering that (
19) is an unconstrained optimization problem, we use gradient descent to solve it. Take the partial derivative of
J with respect to
, we have
Let (
21) take 0, we can get the optimization formula of
as
Replacing
in (
20) with the result obtained in (
22) and simplifying it, we can get
where
and
. Similarly, (
23) is also an unconstrained optimization problem, so we take the partial derivative of (
23) in terms of
, then we can get
where
is a diagonal matrix whose
i-th diagonal elements is
and
is a fixed minimal constant value. By making the partial derivative value 0, the expression of
is obtained
The corresponding objective function in terms of variable
is
where
measures the approximation error on sample
by the
-norm. It is interesting that parameter
k and the variable
are closely related in model (
27). To be specific, the value of
k represents the number of elements of one in
, corresponding to the number of normal samples involved in model training. By calculating and ranking the loss
for each sample, the optimal
can be obtained, in which the weights of the first
k samples with the smallest errors are all set as one and the remaining values are zeros. This regularity is depicted by the following Theorem 1.
Theorem 1. The optimal in problem (27) is a binary vector, in which the corresponding weights of the first k samples with the smallest errors are one and the others are zero. Proof 1. Suppose there is another weight vector
that satisfies
s.t.
Firstly, we sort each sample in ascending order of error, i.e.,
After a simple split on (
29), we get
By moving the first term to the left of the inequality sign to the right, we have
According to (
30), we can get
After combining (
32) and formula (
33), we can get
When inequality signs in (
30) are not taken equal at the same time, Equation (
34) obviously does not hold, so the assumption is logically impossible. Therefore, the optimal
s in problem (
27) is a binary vector, in which the corresponding weights of the first
k samples with the smallest errors are one and the others are zero. □
From problem (
15), the objective function in terms of variable
is
where
and
. It is easily verified that matrix
is similar to the one defined in [
40] in supervised learning. Considering that in the optimization process of (
35), the optimization process of each element is independent of each other, therefore, for an element in the matrix, we can convert (
35) into the following form
Obviously, the solution to the above optimization problem is
Then, the solution to (
35) is
In the formula (
38), we find that the
-dragging method essentially encourages the predicted value
to migrate towards
or
by constructing the loss function as a two-class piecewise function. Below we will use an example to detail the updating method during the optimization process. One point to note is that in the algorithm, we have implemented the
-dragging method through matrix
and matrix
, and the elements in
can be regarded as
. To maintain the consistence of symbols, the element is represented as
.
For the sample
, we set the emotional label as
. In the model (
35), its predicted value is
. In order to distinguish it from
in the LSR, we set the
i-th column of the direction matrix
to be
. The direction vector of
is
. The error generated by the sample is,
Based on (
39), the
consists of four items, which can be divided into two categories. One is the error generated by the predicted value of the category to which the sample belongs, corresponding to the first item in (
39); the other is the error generated by the predicted value of other categories, corresponding to the second, third, and fourth items in (
39). We can easily conclude the following findings.
There are two steps in the calculation of the error of the category of the sample (that is, the first item in (
39)). Firstly,
can be represented by the following piecewise function,
Then the corresponding error is calculated as,
The formula (
41) indicates that when the predicted value
of the category to which the sample belongs is greater than 1, the error generated by
is 0. However, in the traditional LSR, an error will still be generated at this time, and its value is
. The
-dragging method uses the no-negative slack variable
, offsetting the corresponding error. Thereby it encourages the predicted value
of the category to move in the direction
.
Similarly, for the samples of the other categories (i.e., items 2, 3, and 4 in (
39)), the analysis is presented below. The variable
can be represented as
Secondly, the corresponding error is calculated as,
The formula (
43) indicates that when the predicted value
of other categories is less than 0, the error generated by
is 0. However, in LSR, an error will still be generated at this time, and its value is
. The
-dragging method uses the no-negative slack variable
to offset the corresponding error. Thereby it encourages the predicted value of other category
to move in the direction
.
Rows in
are independent of each other, and then we can optimize
in a column-wise manner. To be specific, for
, the corresponding objective function in terms of
is
where
. To simplify the derivation, we optimize the equivalent form of problem (
44) as
Problem (
45) can be solved by the Lagrange multiplier method combined with KKT condition. The corresponding Lagrangian function is
where
and
are the Lagrangian multipliers respectively in scalar and vector forms. Assume that
is the optimal solution and the associated optimal Lagrangian multipliers are
and
. According to the KKT condition, we have
for each
. The last expression in (
47) can be written in vector form as
Because
, the parameter a in (
48) can be written as
Substituting (
49) into (
48), we have
Denote
and
, (
50) can be rewritten as
Through (
47) and (
52), we can get
, where
Therefore, (
52) can be written as
At this time, if the optimal
can be determined, according to (
54), the optimal
can also be determined. Similar to (
54), (
52) can be rewritten as
such that
Therefore,
can be calculated as
According to the constraint
and (
54), we can define the following function
and the optimal
should satisfy
. When (
56) equals 0, the optimal
can be calculated by the Newton method
Finally, the optimization procedure to sub-problem (
44) is listed in Algorithm 2 and the procedure to solve problem (
15) is summarized in Algorithm 3.
Algorithm 2 The algorithm to solve sub-problem (45). |
- Input:
vector - Output:
variable - 1:
Compute ; - 2:
Use Newton’s method to obtain the root of ( 56); - 3:
Obtain the optimal solution by ( 54);
|
Algorithm 3 The optimization algorithm of RSRRW. |
- Input:
Labeled EEG data and its corresponding label matrix , unlabeled EEG data , parameters ; - Output:
The estimated label . - 1:
Initialize randomly. Initialize each element of as . Initialize each element of as 0. Initialize each element of as 1; - 2:
while not converged do - 3:
- 4:
- 5:
- 6:
- 7:
Update by solving ( 45) for each ; - 8:
end while
|