Abstract
Fuzzy extreme learning machine (FELM) is an effective algorithm for dealing with classification problems with noises, which uses a membership function to effectively suppress noise in data. However, FELM has the following drawbacks: (a) The membership degree of samples in FELM is constructed by considering only the distance between the samples and the class center, not the local information of samples. It is easy to mistake some boundary samples for noises. (b) FELM uses the least squares loss function, which leads to sensitivity to feature noise and instability to re-sampling. To address the above drawbacks, we propose an intuitionistic fuzzy extreme learning machine with the truncated pinball loss (TPin-IFELM). Firstly, we use the K-nearest neighbor (KNN) method to obtain local information of the samples and then construct membership and non-membership degrees for each sample in the random mapping feature space based on valuable local information. Secondly, we calculate the score value of samples based on the membership and non-membership degrees, which can effectively identify whether the boundary samples are noises or not. Thirdly, in order to maintain the sparsity and robustness of the model, and enhance the stability of the resampling of the model, we introduce the truncated pinball loss function into the model. Finally, in order to solve more efficiently, we employ the concave-convex procedure (CCCP) to solve TPin-IFELM. Extensive comparative experiments are conducted on the benchmark datasets to verify the superior performance of TPin-IFELM.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Extreme learning machine (ELM) [1] is a simple and efficient single-hidden layer feedforward neural network (SLFN) that is much faster than other feedforward neural networks. The input weights and hidden layer biases of ELM are randomly generated, and the network output weights are obtained by using Moore-Penrose generalized inverse. The objective of ELM is to achieve the minimal norm of output weights and the minimum training error. In recent years, to improve the classification performance of ELM, some improvements have been proposed [2,3,4,5,6,7,8,9,10,11,12]. Due to their excellent classification performance, these algorithms have been applied to a wide range of fields [13,14,15,16,17,18,19,20,21,22,23].
In practical classification problems, there is a large amount of noise in the data. Noise can interfere with the construction of classifiers and reduce the classification performance of algorithms. However, traditional ELMs cannot effectively suppress the negative impact of noise. To enhance the noise robustness of ELM, Ren et al. [24] proposed the correntropy-based hinge loss robust extreme learning machine (CHELM). Wang et al. [25] proposed the extreme learning machine with the homotopy loss (\(l_1\)-HELM), which introduces the homotopy loss into ELM. To enhance the noise robustness and re-sampling stability, Ren et al. [26] proposed the extreme learning machine with the pinball loss function (PELM), which maximizes the quantile distance between two classes of samples. To further achieve robustness and sparsity, Shen et al. [27] introduced \(\varepsilon \)-insensitive zone pinball loss into the ELM. In practical classification problems, there are a large number of redundant or irrelevant features in the data. These algorithms are negatively affected by these features, which reduces their performance. To reduce redundant or irrelevant features, Huang et al. [28] propose a method that employs a new fuzzy \(\beta \) neighborhood-related discernibility measure and the fuzzy \(\beta \) covering (FBC) decision tables. To enhance the robustness of FBC in feature learning, Huang et al. [29] propose a noise-tolerant fuzzy-\(\beta \)-covering-based multigranulation rough set model. To deal with noisy data, VPDI uses noise-tolerant discrimination indexes and a heuristic feature selection algorithm to reduce redundant or irrelevant features [30]. Although these algorithms based on the FBC feature selection can eliminate redundant or irrelevant features, they do not consider the importance of samples in the classification process. In recent years, determining the importance of samples has become a research hotspot. Inspired by FSVM [31], Zhang et al. [32] proposed the fuzzy extreme learning machine (FELM). FELM employs a membership degree to each training sample, which can reduce the influence of outliers and noise.
FELM is an effective algorithm for dealing with classification problems with noises. However, it has two drawbacks: (1) FELM only considers the membership degree of samples but not the non-membership degree of samples, which can easily mistake the boundary samples for noise. (2) FELM uses the least square loss function, which leads to sensitivity to the feature noise and instability to re-sampling. To address the above drawbacks, we propose an improved ELM model combining IFSs and truncated pinball loss function, called an intuitionistic fuzzy extreme learning machine with the truncated pinball loss (TPin-IFELM). First, TPin-IFELM constructs the membership and non-membership degrees based on the local information of samples obtained by using the KNN method. The membership degree is calculated by the distance between the sample and the class center, and the non-membership degree is calculated by the correlation between all heterogeneous samples and all samples in its neighborhood. Further, we obtain the score value of the sample according to the membership and non-membership degrees. The score can be used to effectively identify whether boundary samples are noises. Finally, in order to further reduce the negative effects of noises, we introduce the truncated pinball loss function [33, 34] into TPin-IFELM, which makes TPin-IFELM more robust and sparse. Since TPin-IFELM is a non-convex problem, we use the CCCP [35, 36] to solve it. A large number of experimental results show that the proposed TPin-IFELM is superior to some state-of-the-art comparison algorithms in dealing with classification problems with noises.
The rest of this paper is organized as follows. In Sect. 2, we briefly review ELM and its loss functions, and FELM and its improvement. In Sect. 3, we discuss the optimization model for the linear and nonlinear TPin-IFELM in detail. In Sect. 4, we investigate some properties of TPin-IFELM. In Sect. 5, the TPin-IFELM is evaluated via a series of experiments. Section 6 summarizes this paper and puts forward the future research direction.
2 Related Works
2.1 Notations
We define the binary dataset \(D=\left\{ \left( x_i,t_i\right) \mid 1\le i\le N\right\} \), where \(t_i=\left\{ +1,-1\right\} \). Let \(\mathcal {X}^+=\left\{ x_i\mid \left( x_i,t_i\right) \in D,t_i=+1\right\} \) denote the positive samples, \(\mathcal {X}^-=\{x_j\mid (x_j,t_j)\in D,t_j=-1\}\) denote the negative samples, \(N^+=\vert \mathcal {X}^+\vert \), \(N^-=\vert \mathcal {X}^-\vert \), \(\mathcal {X}=\mathcal {X}^+\cup \mathcal {X}^-\), and \(N\ =N^++N^-\).
2.2 ELM and its Loss Functions
ELM is an effective single-layer feedforward neural network [2, 37]. First, the input weights and hidden layer biases are randomly assigned, then the hidden layer matrix H is obtained by the activation function \(G\left( \cdot \right) \), and finally the output weights \(\beta \) are obtained by solving the generalized inverse.
The output of ELM is defined as follows:
where \(\theta _i={[\theta _{i1},\ldots ,\theta _{in}]}^{\textrm{T}}\in \mathfrak {R}^n\) and \(\vartheta _i\in \mathfrak {R}\) are the input weight vector and bias of the corresponding hidden node, respectively, \(h\left( x\right) =\ \left[ G({\theta _1}^{\textrm{T}}{x+\vartheta }_1),\ldots ,G({\theta _L}^{\textrm{T}}{x+\vartheta }_L)\right] ^{\textrm{T}}\in \mathfrak {R}^L\) is the random feature mapping output of the hidden layer, and \(\beta ={\ \left[ \beta _1,\beta _2,\ldots ,\beta _L\right] }^{\textrm{T}}\in \mathfrak {R}^L\).
\(\beta \) can be solved by solving
The optimal solution of (2) can be calculated by
where \(H=\left[ h\left( x_1\right) ,\ldots ,h\left( x_N\right) \right] ^{\textrm{T}}\in \mathfrak {R}^{N\times L}\) and \(T={\ \left[ t_1,t_2,\ldots ,t_N\right] }^{\textrm{T}}\in \mathfrak {R}^N\) is the output vector. \(H^\dag \) is the Moore-Penrose generalized inverse of matrix H.
The decision function of ELM is
In order to improve the classification performance of ELM, Huang et al. [3] proposed the optimization method-based ELM (OELM), which introduces the hinge loss function into ELM. To speed up the solution, Huang et al.[2] proposed the regular ELM (RELM), which introduces the least squares loss function into ELM. However, OELM and RELM are sensitive to noise. To enhance the noise robustness of ELM, Ren et al. [26] proposed the extreme learning machine with the pinball loss function (PELM), which maximizes the quantile distance between two classes of samples.
For convenience, we unify the optimization problems of these algorithms as follows:
where \(L\left( \textrm{U}\right) \) is the loss function. When \(L\left( \textrm{U}\right) \) is the hinge loss or pinball loss function, \(\textrm{U}=1-t_ih\left( x_i\right) \cdot \beta \) (see Fig. 1a, b); When \(L\left( \textrm{U}\right) \) is the least squares loss function, \(\textrm{U}=t-h\left( x_i\right) \cdot \beta \) (see Fig. 1c).
2.3 FELM and Its Improvements
FELM [32] employs a membership degree to each training sample to reduce the influence of outliers and noise. The optimization problem of FELM can be formulated as follows:
where \(\xi _i\) is the training error, c is the penalty parameter, \(s_i=\left\{ \begin{array}{ll} 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| }{r^++\delta },&{}t_i=+1\\ 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| }{r^-+\delta },&{}t_i=-1\\ \end{array}\right. \) is the membership degree of \(x_i\) in the random mapping feature space, \({\widetilde{\mathcal {C}}}^+=\frac{1}{N^+}\sum _{x_i\in \mathcal {X}^+} h\left( x_i\right) \) and \({\widetilde{\mathcal {C}}}^-=\ \frac{1}{N^-}\sum _{x_i\in \mathcal {X}^-} h\left( x_i\right) \) are the centers of the positive class and negative class, respectively, and \(r^+=\max _{x_i\in \mathcal {X}^+}(\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| )\) and \(r^-=\max _{x_i\in \mathcal {X}^-}(\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| )\) are the radius of the positive class and negative class, respectively.
However, FELM only considers the membership degree of samples, which can easily mistake some boundary samples for noise. To identify the noise in the support vectors, Rezvani et al. proposed intuitionistic fuzzy twin support vector machines (IFTSVM), which use the intuitionistic fuzzy sets (IFSs) to construct the score values of samples [38]. In order to enhance the robustness and re-sampling stability of IFTSVM, Liang et al. proposed an intuitionistic fuzzy twin support vector machines with the \(\varepsilon \)-insensitive pinball loss (PIFTSVM) [39]. It defines the score function named SFA:
where \(\mu (x)\) is the membership function, and \(\nu (x)\) is the non-membership function. Laxmi et al. proposed multi-category intuitionistic fuzzy twin support vector machines to solve the multi-class classification problems [40]. In order to effectively solve the problem of class imbalance, Rezvani et al. proposed class balance learning using fuzzy ART and intelligent fuzzy twin support vector machines.
3 Intuitionistic Fuzzy Extreme Learning Machines with the Truncated Pinball Loss
In this section, we propose TPin-IFELM to address the drawbacks of FELM. The algorithm framework of TPin-IFELM is shown in Fig. 2.
3.1 Intuitionistic Fuzzy Settings
FELM easily mistakes some boundary samples for noise due to its only using membership degree. To address this issue, in this subsection, We employ IFS for each sample to reduce the negative impact of noise.
Define an intuitionistic fuzzy set \(\bar{A}=\ \left\{ \left( x,\mu _{\bar{A}}\left( x\right) ,\nu _{\bar{A}}\left( x\right) \right) |\ x\in \mathcal {X}\right\} \), where \(\mu _{\bar{A}}\): \(\mathcal {X}\) \(\rightarrow \left[ 0,1\right] \) is the membership degree of x in \(\mathcal {X}\), \(\nu _{\bar{A}}: \mathcal {X}\rightarrow \left[ 0,1\right] \) is the non-membership degree of x in \(\mathcal {X}\), and \(0\le \mu _{\bar{A}}\left( x\right) +\nu _{\bar{A}}\left( x\right) \le 1\). We illustrate the acquisition of membership and non-membership degrees through the following examples.
3.1.1 Intuitionistic Fuzzy Membership Degree
In the random mapping feature space, the membership degree of samples is determined by the distance between samples and the class center, i.e.,
where \(1 \le i\le N\), \(\varrho >0\) is an adjustable parameter in the random mapping feature space.
Example 1
Let \(h(x_*) = (0.91, 0.27, 0.21, 0.22, 0.23)\), \(t_*=+1\), \({\widetilde{\mathcal {C}}}^+ = (0.80, 0.52, 0.40, 0.57, 0.43)\) is the center of the positive samples, and \(r^+ = 0.87\) is the radius of positive samples. According to Eq. (8), \(\mu _*=1-\frac{0.5227}{0.87+{10}^{-7}}=0.3992\).
3.1.2 Intuitionistic Fuzzy Non-membership Degree
We can effectively capture the correlation between \(x_i\) and all heterogeneous samples in its neighborhood by using the KNN method, i.e.,
where \(KNN\left( h\left( x_i\right) \right) \) is used to represent the K nearest neighbors of \(x_i\) in the random mapping feature space.
The non-membership degree \(\upsilon _i\) is defined as:
and \(0\le \mu _i+\upsilon _i\le 1.\)
Example 2
Let \(K = 5\) and \(\rho \left( x_i\right) = \frac{4}{5}\). According to Eq. (10), \(\upsilon _*=\left( 1-0.3992\right) \times \frac{4}{5}=0.4806\).
3.1.3 The Score Function
We construct an IFS \(\breve{S}=\left\{ \left( x_1,t_1,\mu _1,\nu _1\right) ,\left( x_2,t_2,\mu _2,\nu _2\right) ,\ldots ,\left( x_N,t_N,\mu _N,\nu _N\right) \right\} \). According to \(\breve{S}\), we construct the score value (SV) as follows:
where \(s_i=\mu _i\) indicates that \(x_i\) is a correctly classified sample; \(s_i=0\) indicates that \(x_i\) is the noise; \(s_i=\frac{1-\nu _i}{2-\mu _i-\nu _i}\) indicates that \(x_i\) is the support vector of the corresponding class, not the noise.
3.2 Linear Case
Unlike FELM [32] which uses the least squares loss function, TPin-IFELM employs the truncated pinball loss function, which not only makes the model robust to the noises but also preserves the sparsity. The truncated pinball loss function (see Fig. 3) is as follows:
where \(0\le \tau \le 1\), and \(\varsigma >0\) is the preset value, and t is the label of x.
As shown in Fig. 3, the truncated pinball loss function takes into account the advantages of the hinge loss function and pinball loss function, so it has noise robustness and sparsity.
Equation (12) can be decomposed as follows:
where \(H_{1+\tau }\left( 1-tf\left( x\right) \right) = \left( 1+\tau \right) max\left( 0,1-tf(x)\right) \) and \(H_\tau \left( 1-tf\left( x\right) +\varsigma \right) = \tau max\left( 0,1-tf\left( x\right) +\varsigma \right) \).
We replace the least squares loss function in Eq. (6) with the truncated pinball loss and employ the score value for each sample as follows:
where c is the penalty parameter.
The gradient \(\nabla _\beta \left( J\left( \beta \right) \right) \) of \(J\left( \beta \right) \) is as follows:
It can be proved that the minimum of (14) with respect to \(\beta \) should satisfy the following condition
The function \(J\left( \beta \right) \) in (14) can be decomposed into the sum of the convex function \(J_{vex}(\beta )\) and the concave function \(J_{cav}(\beta )\), i.e.,
Obviously, (17) is a non-differentiable non-convex optimization problem, which can be solved by the CCCP. The detailed procedure of the CCCP is shown in Algorithm 1.
Using the CCCP to solve (17), the subproblem of the kth iteration can be expressed as:
where
By introducing the slack variables \(\xi =\left[ \xi _1,\ldots ,\xi _N\right] ^{\textrm{T}}\), (18) is equivalent to the following form:
According to the Lagrange method, we can obtain the following dual problem. The detailed solution process is shown in Appendix A.
where \(Q=TH{H}^{\textrm{T}}T\).
Set \(\lambda =\ \alpha -\delta \), and the lower and upper bounds of the box constraint are defined as \(\mathfrak {L}=-\delta \in \mathfrak {R}^N\) and \(\mathfrak {U}=\left( 1+\tau \right) cS-\ \delta \in \mathfrak {R}^N\). Then, (21) is equivalent to
The label t of the unknown sample x is determined by the following decision function.
The complete process of linear TPin-IFELM is shown in Algorithm 2.
3.3 Nonlinear Case
In the ELM kernel space [2, 3, 41], the membership degree of the sample is defined by
and the non-membership degree of the sample is defined as
where \(\mathcal {K}_{ELM}\left( x_i,x_i\right) = h\left( x_i\right) \cdot h\left( x_i\right) \), \(\mathcal {K}_{ELM}(x_i,x_j)= h\left( x_i\right) \cdot h(x_j)\), \(1\le i\le N\),
and
According to Eq. (24) and Eq. (25), the score function is defined as follows:
The original problem of nonlinear TPin-IFELM can be expressed as
Similar to linear TPin-IFELM, (27) can be solved by CCCP. In the kth iteration, the subproblem of (27) can be expressed as
where \(\varpi \) is the output weight vector in the ELM kernel space, and
The dual problem of (28) is follow as:
where \(\widetilde{Q}=T\Omega _{ELM}T\) and \(\Omega _{ELM}=H{H}^{\textrm{T}}\in \mathfrak {R}^{N\times N}\) whose element \({\Omega _{ELM}}_{ij}=\mathcal {K}_{ELM}(x_i,x_j)\).
Similar to linear TPin-IFELM, Eq. (30) is equivalent to
where \(\mathfrak {L}^\Phi =-\delta ^\Phi \in \mathfrak {R}^N\) and \(\mathfrak {U}^\Phi =\left( 1+\tau \right) cS^\Phi -\ \delta ^\Phi \in \mathfrak {R}^N\).
For the unknown sample x, the decision function of nonlinear TPin-IFELM is
The complete process of nonlinear TPin-IFELM is shown in Algorithm 3.
3.4 The Discussion
In this section, we discuss the relationship between TPin-IFELM and FELM. Similar to ELM, TPin-IFELM, and FELM randomly assign the input weights and the biases of the hidden layer. Then, the hidden layer output matrix is obtained by the activation function.
In order to suppress the negative effects of noises, FELM only uses the membership degree for each sample, while TPin-IFELM employs the membership and non-membership degrees based on the local information of samples. To further reduce the interference of noises, TPin-IFELM uses the truncated pinball loss function to not only maintain sparsity and robustness but also to enhance the re-sampling stability.
4 Properties of the TPin-IFELM
In this section, we analyze the theoretical properties of TPin-IFELM, including noise insensitivity, sparsity, weight scatter minimization, and misclassification error minimization.
4.1 Noise Insensitivity and Sparsity
In this subsection, we discuss the noise insensitivity and sparsity of TPin-IFELM. The sub-gradient function of (12) is
Equation (16) can be rewritten as:
where \(\textbf{0}\in \mathfrak {R}^N\) is a column vector whose elements are all zero.
For given \(\beta \), the index set can be partitioned into five sets,
Since \(\partial P_{\tau ,\varsigma }\left( 1-t_if\left( x_i\right) \right) =0\) when the samples are located in \(\mathcal {S}_0^\beta \), the samples in \(\mathcal {S}_0^\beta \) have no contribution to the calculation of \(\beta \). Therefore, \(\mathcal {S}_0^\beta \) is closely related to the sparsity of (14), in the other word, the parameter \(\varsigma \) can control the number of samples in \(\mathcal {S}_0^\beta \). When the value of \(\varsigma \) is smaller, the more samples in \(\mathcal {S}_0^\beta \), and the better the sparsity of (14). In particular, when \(\varsigma \rightarrow 0\), the truncated pinball loss function can be regarded as a hinge loss function, which is very sensitive to the noises. On the contrary, when the value of \(\varsigma \) is larger, the number of samples in \(\mathcal {S}_0^\beta \) is smaller, and (14) is robust to noises but gradually loses its sparsity. Particularly, when \(\varsigma \rightarrow +\infty \), the truncated pinball loss function degenerates into pinball loss, and the sparsity is completely lost.
According to the five sets \(\mathcal {S}_0^\beta \), \(\mathcal {S}_1^\beta \), \(\mathcal {S}_2^\beta \), \(\mathcal {S}_3^\beta \) and \(\mathcal {S}_4^\beta \) of (35), the optimality condition can be written as the existence of \(\psi _i\in \left[ -\tau ,0\right] \) and \(\zeta _i\in \left[ -\tau ,1\right] \) such that
The number of samples in \(\mathcal {S}_1^\beta \) and \(\mathcal {S}_3^\beta \) is much smaller than that in \(\mathcal {S}_0^\beta \), \(\mathcal {S}_2^\beta \) and \(\mathcal {S}_4^\beta \), and the samples in \(\mathcal {S}_1^\beta \) and \(\mathcal {S}_3^\beta \) make little contribution to Eq. (36). Therefore, the main problem here is about the set \(\mathcal {S}_0^\beta \), \(\mathcal {S}_2^\beta \) and \(\mathcal {S}_4^\beta \). When the value of parameter \(\varsigma \) is fixed to a suitable value, the parameter \(\tau \) can be used to control the number of samples in \(\mathcal {S}_0^\beta \), \(\mathcal {S}_2^\beta \) and \(\mathcal {S}_4^\beta \), and the sparsity of (14) is affected. When \(\tau \) is large, such as \(\tau =1\), these three sets contain many samples, so (14) is robust to the feature noise. When \(\tau \) is very small, such as \(\tau =0.1\), there are few samples in \(\mathcal {S}_4^\beta \), and (14) is more sensitive. Especially, when \(\tau =0\), there are no samples or only a few samples in \(\mathcal {S}_4^\beta \). Therefore, when constructing the model, the feature noise around the decision boundary will bring significant negative effects. Since the total number of samples is fixed when \(\tau \) is smaller, the smaller the number of samples in \(\mathcal {S}_4^\beta \), the larger the number of samples in \({\ \mathcal {S}}_0^\beta \), and the better the sparsity of (14).
In summary, the appropriate \(\tau \) and \(\varsigma \) are chosen to enable TPin-IFELM to better balance noise insensitivity and sparsity.
4.2 Weight Scatter and Misclassification Error Minimization
The mechanism of TPin-IFELM can also be explained by the weight scatter and misclassification error minimization. The positive hyperplane \(f_+\left( x\right) :{\beta }^{\textrm{T}} h\left( x\right) =1\) and the negative hyperplane \(f_-\left( x\right) :{\beta }^{\textrm{T}} h\left( x\right) =-1\) are constructed by the samples in \(\mathcal {S}_3^\beta \). The distance between positive and negative hyperplanes is \(\frac{2}{\left\| {\beta }\right\| }\). We can measure the weight scatter in terms of the sum of the distances of a given point from similar samples. In the random mapping feature space related to \(\beta \), the weight scatter of \(x_{i_0}\) can be defined as
If \(x_{i_0}\in \mathcal {S}_3^\beta \cap \mathcal {X}^+\), i.e., \({{\beta }^{\textrm{T}}}{h\left( x_{i_0}\right) }=1\) and \(t_{i_0}=1\), then
If \(x_{i_0}\in \mathcal {S}_3^\beta \cap \mathcal {X}^-\), i.e., \({{\beta }^{\textrm{T}}}{h\left( x_{i_0}\right) }=-1\) and \(t_{i_0}=-1\), then
Therefore,
can be interpreted as maximizing the distance between hyperplanes \(f_+\left( x\right) \) and \(f_-\left( x\right) \) and meanwhile minimizing weight scatter.
In (14), (40) is extended to \(P_{\tau ,\varsigma }\). The misclassification term
is introduced into Eq. (40), i.e.,
We obtain TPin-IFELM with \(C_1=c\left( 1+\tau \right) \) and \(C_2=c\tau \). Thus, TPin-IFELM can minimize both the weight scatter and misclassification errors, simultaneously.
5 Experiments
In this section, we verify the effectiveness of TPin-IFELM through a series of experiments on the artificial dataset and benchmark datasets.Footnote 1
5.1 Experimental Configuration
In order to evaluate the effectiveness of TPin-IFELM, we compared it with other eight advanced comparison algorithms. TPin-IFELM with SFA, which replaces the score function SV in TPin-IFELM with the score function SFA in PIFTSVM, contains four parameters c, L, \(\tau \), and \(\varsigma \), and TPin-IFELM contains five parameters c, L, \(\tau \), \(\varsigma \) and K. To ensure the objectivity of the experiments, for the datasets with less than 2000 samples, the penalty parameter c of TPin-IFELM and TPin-IFELM with SFA, the penalty parameter C of OELM, RELM, and FELM, and the penalty parameters \(C_1\) and \(C_2\) of TELM, SPTELM, and PIFTSVM are searched from the set \(\left\{ 2^i|i=-10,-8,\ldots ,8,10\right\} \), and the number of hidden layer nodes L for these algorithms are searched from \(\left\{ 50,100,200,500\right\} \). For the datasets with greater than or equal to 2000 samples, the penalty parameters c, C, \(C_1\) and \(C_2\) are searched from \(\left\{ 2^i|i=-10,-6,\ldots ,6,10\right\} \), and the number of hidden layer nodes L is searched from \(\left\{ 50,100,200\right\} \). \(\tau \) and \(\varsigma \) are searched from \(\left\{ 0.25,0.5,0.75\right\} \), \(\varepsilon \) is searched from \(\{0,0.2,0.5\}\), and for TPin-IFELM, the number of nearest neighbors K is searched from \(\left\{ 1,3,\ldots ,20\right\} \).
We implement all algorithms by using MATLAB (R2018a). The experimental environment is a workstation with the 11th Gen Intel Core i5-11,400 H (2.70GHz) processor and 16 G RAM. We use the quadprog to solve the quadratic programming problem and use three evaluation metrics to evaluate the classification performance, including accuracy (ACC), the area under ROC (AUC), and \(F_1\)-measure \((F_1)\).
where FN denotes the number of false negatives, FP denotes the number of false positives, TN denotes the number of true negatives and TP denotes the number of true positives.
5.2 Experiments on the Artificial Dataset
To verify the robustness and sparsity of TPin-IFELM, we conduct comparative experiments on an artificial dataset with two-dimensional features. The training set and test sets consist of 200 samples and 50 samples, respectively. The positive and negative samples of the artificial dataset are generated by the Gaussian distributions \(\mathcal {X}^+\sim \mathcal {N}\left( \mathcal {V}_1,\Sigma _1\right) \) and \(\mathcal {X}^-\sim \mathcal {N}\left( \mathcal {V}_2,\Sigma _2\right) \), respectively, where \(\mathcal {V}_1=\left[ \begin{array}{cc} 1&{}1\\ \end{array}\right] ^{\textrm{T}}\), \(\mathcal {V}_2=\left[ \begin{array}{cc} -1&{}-1\\ \end{array}\right] ^{\textrm{T}}\) and \(\Sigma _1=\Sigma _2=\left[ \begin{array}{cc} 1&{}\\ &{}1\\ \end{array}\right] \).
As shown in Fig. 4, the red “+” and blue “\(\times \)” denote the positive training samples and the negative training samples, respectively. The pink “+” and green “\(\times \)” denote the positive test samples and the negative test samples, respectively. The support vectors are circled by “\(\circ \)”, and the noises identified by the algorithm are framed by black “\(\diamond \)”. We can see that compared with FELM, TPin-IFELM with SFA and TPin-IFELM use both the membership and non-membership degrees and the truncated pinball loss function, so they can more effectively reduce the negative effect of noises. The number of support vectors of TPin-IFELM with SFA and TPin-IFELM is 33% and 29% of the total number of samples, respectively. Thus, compared with FELM, TPin-IFELM with SFA and TPin-IFELM are sparse. Table 1 shows the experimental results of FELM, TPin-IFELM with SFA, and TPin-IFELM on the artificial dataset, and the best results of each evaluation indicator are shown in bold. As shown in Table 1, TPin-IFELM is superior to FELM and TPin-IFELM with SFA in terms of ACC and AUC and is second only to TPin-IFELM with SFA in terms of \(F_1\).
5.3 Experiments on the Benchmark Datasets
To evaluate the effectiveness and robustness of TPin-IFELM, we conduct comparative experiments on 15 benchmark datasets. The detailed characteristics of the datasets are shown in Table 2, where #Samples, #Positive samples, #Negative samples, and #Features denote the number of samples, the number of positive samples, the number of negative samples and the number of features, respectively.
In order to verify the classification performance of TPin-IFELM and other eight comparison algorithms, we conduct extensive experiments on fifteen benchmark datasets. Appendix B provides additional experimental results. Unlike these seven comparison algorithms, TPin-IFELM with SFA and TPin-IFELM employ the membership and non-membership degrees to effectively identify the role of the samples in the classification process. As shown in Tables 7, 8 and 9, in terms of the average rank, each evaluation metric of TPin-IFELM is superior to that of the other eight algorithms, and the ACC and AUC of TPin-IFELM with SFA are only lower than TPin-IFELM.
Noise is commonly present in the datasets and can reduce the classification performance of algorithms. In order to demonstrate the robustness of TPin-IFELM, we conducted noise experiments on 15 benchmark datasets using label noise. We randomly select 50% of the training samples and then add label noise to them. The experimental results are shown in Tables 10, 11 and 12. We can observe that all algorithms are negatively affected by the samples with label noise. However, TPin-IFELM is less disturbed by label noise than the other eight comparison algorithms. In addition, it is superior to the other eight comparison algorithms on most datasets. For the classification problems with label noise and feature noise, we add Gaussian noise [39] that follows normal distribution \(N\left( 0,\sigma ^2\right) \) to the training set to form a training set with feature noise, where \(\sigma \) is 0.5, and then randomly select 50% of the training samples as the samples with label noise. Tables 3, 4 and 5 show the experimental results. The best results for each dataset are shown in bold. TPin-IFELM with SFA and TPin-IFELM are less disturbed by label noise and feature noise than the other seven algorithms and are superior to them on most datasets.
From the above noise experimental results, we can observe that ELM, OELM, RELM, TELM, and SPTELM do not consider the membership degree of the samples to reduce the negative impact of the noise, resulting in a significant decrease in their classification performance. Different from FELM, TPin-IFELM with SFA and TPin-IFELM employ the membership and non-membership degrees to effectively identify the role of the samples and the noise in the classification process. At the same time, they introduce the truncated pinball loss function to enhance the robustness of the model. Compared to TPin-IFELM with SFA, TPin-IFELM uses the local information of the samples to construct the more appropriate membership and non-membership degrees of the samples. Therefore, TPin-IFELM can better solve the classification problems with noise than the other eight comparison algorithms.
5.4 Statistical Analysis
From Tables 7, 8, 9, 10, 11 and 12 and Tables 3, 4 and 5, we can observe that not any algorithm outperforms all other algorithms on all datasets. In this subsection, we use the Friedman test [42] to analyze these algorithms statistically. Given \(\mathfrak {K}\) comparison algorithms and \(\mathcal {N}\) datasets, let \(r_i^j\) denote the rank of the j-th algorithm on the i-th dataset. \(R_j=\frac{1}{\mathcal {N}}\sum _{i=1}^{\mathcal {N}}r_i^j\) denotes the average rank of the j-th algorithm. The Friedman statistics \(F_F=\frac{\left( \mathcal {N}-1\right) \chi _F^2}{\mathcal {N} \left( \mathfrak {K}-1\right) -\chi _F^2}\sim F\left( \mathfrak {K}-1,\left( \mathfrak {K}-1\right) \left( \mathcal {N}-1\right) \right) \), where \(\chi _F^2=\frac{12\mathcal {N}}{\mathfrak {K} \left( \mathfrak {K}+1\right) }\left[ \sum _{j=1}^{\mathfrak {K}}{R_j^2 -\frac{\mathfrak {K}\left( \mathfrak {K}+1\right) ^2}{4}}\right] \). Table 6 shows the Friedman test results on the datasets without noise and datasets with noise. We observe that the Friedman statistics are much larger than the critical value, so the null hypothesis that all algorithms have the same classification performance is rejected, i.e., there is a significant difference in classification performance among the algorithms.
The difference between TPin-IFELM and the other eight algorithms is compared by using the Nemenyi test [42]. The average rank difference between pairs of algorithms is compared by the critical difference (CD), where \(\textrm{CD}=q_\alpha \sqrt{\frac{\mathfrak {K}\left( \mathfrak {K}+1\right) }{6\mathcal {N}}}\). For the Nemenyi test, \(q_\alpha =2.948\) at the significance level \(\alpha =0.05\), thus, for experiments without noise, \(\textrm{CD}=3.102\ \left( \mathfrak {K}=9,\mathcal {N}=15\right) \); for experiments with noise, \(\textrm{CD}=1.5510\ (\mathfrak {K}=9, \mathcal {N}=60)\). The CD diagrams of all evaluation metrics with and without noise are shown in Fig. 5. We observe that TPin-IFELM is superior to the other eight algorithms on each evaluation metric.
5.5 Sensitivity Analysis
To analyze the parameter sensitivity of TPin-IFELM and the performance of methods for obtaining sample structure information, we conduct experiments on the benchmark datasets. The main parameters of TPin-IFELM include the penalty parameter c, the number L of hidden layer nodes, the parameter \(\tau \), the parameter K, and the parameter \(\varsigma \). The methods for obtaining sample structure information include KNN, K-Means, and Ward Linkage.
5.5.1 Methods for Obtaining Sample Structure Information
In order to investigate the impact of different methods of obtaining sample structure information on our TPin-IFELM, we use KNN, K-Means and Ward Linkage to extract local information of samples, respectively, and conduct experiments on the datasets Sonar and Colon-cancer, respectively. The comparative results are shown in Fig. 6. As shown in Fig. 6, TPin-IFELM using KNN achieves optimal performance. Compared to K-Means and Ward Linkage, KNN can more effectively capture the correlation between the sample and all heterogeneous samples in its neighborhood, thus obtaining valuable local information of samples.
5.5.2 Parameters c and L
To analyze the sensitivity of TPin-IFELM to c and L, we perform parameter sensitivity analysis experiments on the datasets Heart and Ionosphere. The parameter c is searched form \(\{2^i\mid i=\ -10,-8,\ldots ,8,10\}\), the parameter L is searched form \(\{50,100,200,500\}\), and the other parameters are fixed. From Fig. 7, we can observe that the ACC, AUC, and \(F_1\) of TPin-IFELM are higher when the value of c is larger and the L value is larger. In general, TPin-IFELM is sensitive to parameter c and is less affected by the change of L.
5.5.3 Parameters \(\tau \), \(\varsigma \) and K
To analyze the effects of parameters \(\tau \), \(\varsigma \), and K on the classification performance of TPin-IFELM, we conducted experiments on the datasets Colon-cancer, Sonar, Heart and Ionosphere without noise and with noise, respectively. There are two types of noise: samples with 30% label noise and samples with 50% label noise and feature noise of \(\sigma = 0.5\). As shown in Figs. 8, 9 and 10, for samples without noise, TPin-IFELM is minimally affected by the parameter \(\tau \), except for the data set Colon-cancer; however, for samples with noise, TPin-IFELM is strongly affected by the parameter \(\tau \). As shown in Figs. 11, 12 and 13, for samples without noise, TPin-IFELM is minimally affected by the parameter \(\varsigma \); however, for samples with noise, TPin-IFELM is sensitive to the parameter \(\varsigma \). As shown in Figs. 14, 15 and 16, TPin-IFELM is sensitive to the parameter K.
6 Conclusion
Inspired by the intuitionistic fuzzy theory and truncated pinball loss, we propose a novel model to solve the classification problem in this paper. TPin-IFELM employs the KNN method to obtain the local information of samples, which can obtain the more suitable membership and non-membership degrees of samples. TPin-IFELM exploits the membership and non-membership degrees to effectively identify whether the boundary samples are noises or not and uses the truncated pinball loss function, which makes it more robust and sparse. A large number of experiments fully verify the effectiveness of TPin-IFELM. Compared with the state-of-the-art comparison algorithms, TPin-IFELM has superior classification performance. In future work, we will extend the proposed model to the multi-view classification problem.
Data availability
The datasets generated during and/or analyzed during the current study are available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets.php, and the LIBSVM data repository, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
Notes
The datasets are available at https://archive.ics.uci.edu/ml/datasets.php and https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
References
Huang G, Huang G, Song S, You K (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Huang G, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Cybernet 42:513–529
Huang G, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1–3):155–163
Sun P, Yang L (2022) Generalized eigenvalue extreme learning machine for classification. Appl Intell 52(6):6662–6691
Ahuja B, Vishwakarma VP (2021) Deterministic multi-kernel based extreme learning machine for pattern classification. Expert Syst Appl 183:115308
Ren L, Liu J, Gao Y, Kong X, Zheng C (2021) Kernel risk-sensitive loss based hyper-graph regularized robust extreme learning machine and its semi-supervised extension for classification. Knowl-Based Syst 227:107226
Wong H, Leung H, Leung C, Wong E (2022) Noise/fault aware regularization for incremental learning in extreme learning machines. Neurocomputing 486:200–214
Luo J, Wong C, Vong C (2021) Multinomial bayesian extreme learning machine for sparse and accurate classification model. Neurocomputing 423:24–33
Liu Z, Jin W, Mu Y (2020) Variances-constrained weighted extreme learning machine for imbalanced classification. Neurocomputing 403:45–52
Zong W, Huang G, Chen Y (2013) Weighted extreme learning machine for imbalance learning. Neurocomputing 101:229–242
Li Y, Zhang J, Zhang S, Xiao W, Zhang Z (2022) Multi-objective optimization-based adaptive class-specific cost extreme learning machine for imbalanced classification. Neurocomputing 496:107–120
Xiao W, Zhang J, Li Y, Zhang S, Yang W (2017) Class-specific cost regulation extreme learning machine for imbalanced classification. Neurocomputing 261:70–82
Dutta AK, Qureshi B, Albagory Y, Alsanea M, Al Faraj M, Sait ARW (2023) Optimal weighted extreme learning machine for cybersecurity fake news classification. Comput Syst Sci Eng 44(3):2395–2409
Tummalapalli S, Kumar L, Neti LBM, Krishna A (2022) Detection of web service anti-patterns using weighted extreme learning machine. Comput Stand Interfaces 82:103621
El Bourakadi D, Yahyaouy A, Boumhidi J (2022) Improved extreme learning machine with autoencoder and particle swarm optimization for short-term wind power prediction. Neural Comput Appl 34(6):4643–4659
Xia J, Yang D, Zhou H, Chen Y, Zhang H, Liu T, Heidari AA, Chen H, Pan Z (2022) Evolving kernel extreme learning machine for medical diagnosis via a disperse foraging sine cosine algorithm. Comput Biol Med 141:105137
Lin Z, Gao Z, Ji H, Zhai R, Shen X, Mei T (2022) Classification of cervical cells leveraging simultaneous super-resolution and ordinal regression. Appl Soft Comput 115:108208
Gao Z, Hu Q, Xu X (2022) Condition monitoring and life prediction of the turning tool based on extreme learning machine and transfer learning. Neural Comput Appl 34(5):3399–3410
Wang Y, Li R, Chen Y (2021) Accurate elemental analysis of alloy samples with high repetition rate laser-ablation spark-induced breakdown spectroscopy coupled with particle swarm optimization-extreme learning machine. Spectrochim Acta Part B-Atomic Spectrosc 177:106077
Wu D, Wang X, Wu S (2022) A hybrid framework based on extreme learning machine, discrete wavelet transform, and autoencoder with feature penalty for stock prediction. Expert Syst Appl 207:118006
Wang GC, Zhang Q, Band SS, Dehghani M, Chau KW, Tho QT, Zhu S, Samadianfard S, Mosavi A (2022) Monthly and seasonal hydrological drought forecasting using multiple extreme learning machine models. Eng Appl Comput Fluid Mech 16(1):1364–1381
Wang L, Khishe M, Mohammadi M, Mahmoodzadeh A (2022) Extreme learning machine evolved by fuzzified hunger games search for energy and individual thermal comfort optimization. J Build Eng 60:105187
Al-Yaseen WL, Idrees AK, Almasoudy FH (2022) Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recogn 132:108912
Ren Z, Yang L (2018) Correntropy-based robust extreme learning machine for classification. Neurocomputing 313:74–84
Wang Y, Yang L, Yuan C (2019) A robust outlier control framework for classification designed with family of homotopy loss function. Neural Netw 112:41–53
Ren Z, Yang L (2019) Robust extreme learning machines with different loss functions. Neural Process Lett 49(3):1543–1565
Shen J, Ma J (2019) Sparse twin extreme learning machine with epsilon-insensitive zone pinball loss. IEEE Access 7:112067–112078
Huang Z, Li J (2022) Discernibility measures for fuzzy \(\beta \) covering and their application. IEEE Trans Cybernet 52(9):9722–9735
Huang Z, Li J, Qian Y (2022) Noise-tolerant fuzzy-\(\beta \)-covering-based multigranulation rough sets and feature subset selection. IEEE Trans Fuzzy Syst 30(7):2721–2735
Huang, Z., Li, J.: Noise-tolerant discrimination indexes for fuzzy \(\gamma \) covering and feature subset selection. IEEE Trans Neural Netw Learn Syst (Early Access)
Lin C, Wang S (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471
Zhang W, Ji H (2013) Fuzzy extreme learning machine for classification. Electron Lett 49(7):448–449
Shen X, Niu L, Qi Z, Tian Y (2017) Support vector machine classifier with truncated pinball loss. Pattern Recogn 68:199–210
Wang H, Xu Y, Zhou Z (2021) Twin-parametric margin support vector machine with truncated pinball loss. Neural Comput Appl 33(8):3781–3798
Yuille A, Rangarajan A (2003) The concave-convex procedure. Neural Comput 15(4):915–936
Lipp T, Boyd S (2016) Variations and extension of the convex-concave procedure. Optim Eng 17(2):263–287
Huang G, Zhu Q, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501
Rezvani S, Wang X, Pourpanah F (2019) Intuitionistic fuzzy twin support vector machines. IEEE Trans Fuzzy Syst 27(11):2140–2151
Liang Z, Zhang L (2022) Intuitionistic fuzzy twin support vector machines with the insensitive pinball loss. Appl Soft Comput 115:108231
Laxmi S, Gupta SK (2022) Multi-category intuitionistic fuzzy twin support vector machines with an application to plant leaf recognition. Eng Appl Artif Intell 110:104687
Wong CM, Vong CM, Wong PK, Cao J (2018) Kernel-based multilayer extreme learning machines for representation learning. IEEE Trans Neural Netw Learn Syst 29(3):757–762
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Acknowledgements
This work was supported in part by the Natural Science Foundation of Liaoning Province in China (2020-MS-281). We thank all anonymous reviewers for their helpful comments, which improved the quality of this paper.
Author information
Authors and Affiliations
Contributions
Conceptualization: QG, QA; Methodology: QG, QA; Writing-original draft preparation: QG; Writing-review and editing: QG, QA; Funding acquisition: QA; Supervision: QA, WW.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
All authors contributed to the conception and design of the study. All authors read and approved the final manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Process of Obtaining the Dual Problem of (20)
In this section, we focus on solving the Problem (20). For clarity, the iteration superscript \(k-1\) is removed. By introducing the Lagrangian multipliers \(\alpha \) and \(\theta \), the Lagrangian function of the original problem (20) is
where \(S=\left[ s_1,\ldots ,s_N\right] ^{\textrm{T}}\), \(\delta =\left[ \delta _1,\ldots ,\delta _N\right] ^{\textrm{T}}\), \(T=\left[ \begin{array}{ccc} t_1&{}&{}\\ &{}\ddots &{}\\ &{}&{}t_N\\ \end{array}\right] \), \(H=\left[ h\left( x_1\right) ,\ldots ,h\left( x_N\right) \right] ^{\textrm{T}}\in \mathfrak {R}^{N\times L}\), \(e=\left[ 1,\ldots ,1\right] ^{\textrm{T}}\in \mathfrak {R}^N\), \(\alpha =\left[ \alpha _1,\ldots ,\alpha _N\right] ^{\textrm{T}}\) and \(\theta =\left[ \theta _1,\ldots ,\theta _N\right] ^{\textrm{T}}\) are the Lagrangian multiplier vectors.
According to the KKT conditions, we can obtain
From (A2), we can obtain
According to Eq. (A3) and Eq. (A6), we can derive
By substituting Eq. (A3) and Eq. (A7) into Eq. (A1), we can obtain the following dual problem.
where \(Q=TH{H}^{\textrm{T}}T\).
Appendix B Additional Experiments
We present the experimental results in the noise-free environment and 50% label noise environment. The best results for each dataset are shown in bold.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gao, Q., Ai, Q. & Wang, W. Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss. Neural Process Lett 56, 116 (2024). https://doi.org/10.1007/s11063-024-11492-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11492-5