Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss

Gao, Qingyun; Ai, Qing; Wang, Wenhui

doi:10.1007/s11063-024-11492-5

Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss

Open access
Published: 20 March 2024

Volume 56, article number 116, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss

Download PDF

Qingyun Gao¹,
Qing Ai¹ &
Wenhui Wang²

506 Accesses
1 Altmetric
Explore all metrics

Abstract

Fuzzy extreme learning machine (FELM) is an effective algorithm for dealing with classification problems with noises, which uses a membership function to effectively suppress noise in data. However, FELM has the following drawbacks: (a) The membership degree of samples in FELM is constructed by considering only the distance between the samples and the class center, not the local information of samples. It is easy to mistake some boundary samples for noises. (b) FELM uses the least squares loss function, which leads to sensitivity to feature noise and instability to re-sampling. To address the above drawbacks, we propose an intuitionistic fuzzy extreme learning machine with the truncated pinball loss (TPin-IFELM). Firstly, we use the K-nearest neighbor (KNN) method to obtain local information of the samples and then construct membership and non-membership degrees for each sample in the random mapping feature space based on valuable local information. Secondly, we calculate the score value of samples based on the membership and non-membership degrees, which can effectively identify whether the boundary samples are noises or not. Thirdly, in order to maintain the sparsity and robustness of the model, and enhance the stability of the resampling of the model, we introduce the truncated pinball loss function into the model. Finally, in order to solve more efficiently, we employ the concave-convex procedure (CCCP) to solve TPin-IFELM. Extensive comparative experiments are conducted on the benchmark datasets to verify the superior performance of TPin-IFELM.

A New Fuzzy Extreme Learning Machine for Regression Problems with Outliers or Noises

Intuitionistic fuzzy broad learning system with a new non-membership function

Article 15 August 2024

Fuzzy One-Class Extreme Auto-encoder

Article 01 November 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Extreme learning machine (ELM) [1] is a simple and efficient single-hidden layer feedforward neural network (SLFN) that is much faster than other feedforward neural networks. The input weights and hidden layer biases of ELM are randomly generated, and the network output weights are obtained by using Moore-Penrose generalized inverse. The objective of ELM is to achieve the minimal norm of output weights and the minimum training error. In recent years, to improve the classification performance of ELM, some improvements have been proposed [2,3,4,5,6,7,8,9,10,11,12]. Due to their excellent classification performance, these algorithms have been applied to a wide range of fields [13,14,15,16,17,18,19,20,21,22,23].

In practical classification problems, there is a large amount of noise in the data. Noise can interfere with the construction of classifiers and reduce the classification performance of algorithms. However, traditional ELMs cannot effectively suppress the negative impact of noise. To enhance the noise robustness of ELM, Ren et al. [24] proposed the correntropy-based hinge loss robust extreme learning machine (CHELM). Wang et al. [25] proposed the extreme learning machine with the homotopy loss ($l_1$-HELM), which introduces the homotopy loss into ELM. To enhance the noise robustness and re-sampling stability, Ren et al. [26] proposed the extreme learning machine with the pinball loss function (PELM), which maximizes the quantile distance between two classes of samples. To further achieve robustness and sparsity, Shen et al. [27] introduced $\varepsilon $-insensitive zone pinball loss into the ELM. In practical classification problems, there are a large number of redundant or irrelevant features in the data. These algorithms are negatively affected by these features, which reduces their performance. To reduce redundant or irrelevant features, Huang et al. [28] propose a method that employs a new fuzzy $\beta $ neighborhood-related discernibility measure and the fuzzy $\beta $ covering (FBC) decision tables. To enhance the robustness of FBC in feature learning, Huang et al. [29] propose a noise-tolerant fuzzy-$\beta $-covering-based multigranulation rough set model. To deal with noisy data, VPDI uses noise-tolerant discrimination indexes and a heuristic feature selection algorithm to reduce redundant or irrelevant features [30]. Although these algorithms based on the FBC feature selection can eliminate redundant or irrelevant features, they do not consider the importance of samples in the classification process. In recent years, determining the importance of samples has become a research hotspot. Inspired by FSVM [31], Zhang et al. [32] proposed the fuzzy extreme learning machine (FELM). FELM employs a membership degree to each training sample, which can reduce the influence of outliers and noise.

FELM is an effective algorithm for dealing with classification problems with noises. However, it has two drawbacks: (1) FELM only considers the membership degree of samples but not the non-membership degree of samples, which can easily mistake the boundary samples for noise. (2) FELM uses the least square loss function, which leads to sensitivity to the feature noise and instability to re-sampling. To address the above drawbacks, we propose an improved ELM model combining IFSs and truncated pinball loss function, called an intuitionistic fuzzy extreme learning machine with the truncated pinball loss (TPin-IFELM). First, TPin-IFELM constructs the membership and non-membership degrees based on the local information of samples obtained by using the KNN method. The membership degree is calculated by the distance between the sample and the class center, and the non-membership degree is calculated by the correlation between all heterogeneous samples and all samples in its neighborhood. Further, we obtain the score value of the sample according to the membership and non-membership degrees. The score can be used to effectively identify whether boundary samples are noises. Finally, in order to further reduce the negative effects of noises, we introduce the truncated pinball loss function [33, 34] into TPin-IFELM, which makes TPin-IFELM more robust and sparse. Since TPin-IFELM is a non-convex problem, we use the CCCP [35, 36] to solve it. A large number of experimental results show that the proposed TPin-IFELM is superior to some state-of-the-art comparison algorithms in dealing with classification problems with noises.

The rest of this paper is organized as follows. In Sect. 2, we briefly review ELM and its loss functions, and FELM and its improvement. In Sect. 3, we discuss the optimization model for the linear and nonlinear TPin-IFELM in detail. In Sect. 4, we investigate some properties of TPin-IFELM. In Sect. 5, the TPin-IFELM is evaluated via a series of experiments. Section 6 summarizes this paper and puts forward the future research direction.

2 Related Works

2.1 Notations

We define the binary dataset $D=\left\{ \left( x_i,t_i\right) \mid 1\le i\le N\right\} $, where $t_i=\left\{ +1,-1\right\} $. Let $\mathcal {X}^+=\left\{ x_i\mid \left( x_i,t_i\right) \in D,t_i=+1\right\} $ denote the positive samples, $\mathcal {X}^-=\{x_j\mid (x_j,t_j)\in D,t_j=-1\}$ denote the negative samples, $N^+=\vert \mathcal {X}^+\vert $, $N^-=\vert \mathcal {X}^-\vert $, $\mathcal {X}=\mathcal {X}^+\cup \mathcal {X}^-$, and $N\ =N^++N^-$.

2.2 ELM and its Loss Functions

ELM is an effective single-layer feedforward neural network [2, 37]. First, the input weights and hidden layer biases are randomly assigned, then the hidden layer matrix H is obtained by the activation function $G\left( \cdot \right) $, and finally the output weights $\beta $ are obtained by solving the generalized inverse.

The output of ELM is defined as follows:

$$\begin{aligned} f\left( x\right) =\ \sum _{i=1}^{L}G({\theta _i}^{\textrm{T}}{x+\vartheta }_i)\beta _i=h\left( x\right) \cdot \beta , \end{aligned}$$

(1)

where $\theta _i={[\theta _{i1},\ldots ,\theta _{in}]}^{\textrm{T}}\in \mathfrak {R}^n$ and $\vartheta _i\in \mathfrak {R}$ are the input weight vector and bias of the corresponding hidden node, respectively, $h\left( x\right) =\ \left[ G({\theta _1}^{\textrm{T}}{x+\vartheta }_1),\ldots ,G({\theta _L}^{\textrm{T}}{x+\vartheta }_L)\right] ^{\textrm{T}}\in \mathfrak {R}^L$ is the random feature mapping output of the hidden layer, and $\beta ={\ \left[ \beta _1,\beta _2,\ldots ,\beta _L\right] }^{\textrm{T}}\in \mathfrak {R}^L$.

$\beta $ can be solved by solving

$$\begin{aligned} min\left\| {\beta }\right\| \ \textrm{and}\ min{\sum _{i=1}^{N}{\left\| {\beta \cdot h\left( x_i\right) -t_i} \right\| }}. \end{aligned}$$

(2)

The optimal solution of (2) can be calculated by

$$\begin{aligned} \hat{\beta }=H^\dag T,\ \end{aligned}$$

(3)

where $H=\left[ h\left( x_1\right) ,\ldots ,h\left( x_N\right) \right] ^{\textrm{T}}\in \mathfrak {R}^{N\times L}$ and $T={\ \left[ t_1,t_2,\ldots ,t_N\right] }^{\textrm{T}}\in \mathfrak {R}^N$ is the output vector. $H^\dag $ is the Moore-Penrose generalized inverse of matrix H.

The decision function of ELM is

$$\begin{aligned} f\left( x\right) =\ sign\left( {h\left( x\right) }^{\textrm{T}}\hat{\beta }\right) . \end{aligned}$$

(4)

In order to improve the classification performance of ELM, Huang et al. [3] proposed the optimization method-based ELM (OELM), which introduces the hinge loss function into ELM. To speed up the solution, Huang et al.[2] proposed the regular ELM (RELM), which introduces the least squares loss function into ELM. However, OELM and RELM are sensitive to noise. To enhance the noise robustness of ELM, Ren et al. [26] proposed the extreme learning machine with the pinball loss function (PELM), which maximizes the quantile distance between two classes of samples.

For convenience, we unify the optimization problems of these algorithms as follows:

$$\begin{aligned} min_\beta \ {\frac{1}{2}\left\| {\beta } \right\| ^2+c\sum _{i=1}^{N}L\left( \textrm{U}\right) }, \end{aligned}$$

(5)

where $L\left( \textrm{U}\right) $ is the loss function. When $L\left( \textrm{U}\right) $ is the hinge loss or pinball loss function, $\textrm{U}=1-t_ih\left( x_i\right) \cdot \beta $ (see Fig. 1a, b); When $L\left( \textrm{U}\right) $ is the least squares loss function, $\textrm{U}=t-h\left( x_i\right) \cdot \beta $ (see Fig. 1c).

2.3 FELM and Its Improvements

FELM [32] employs a membership degree to each training sample to reduce the influence of outliers and noise. The optimization problem of FELM can be formulated as follows:

$$\begin{aligned}&{min_{\beta ,\xi _i}}\ \frac{1}{2}\left\| {\beta } \right\| ^2+\frac{c}{2}\sum _{i=1}^{N}{s_i{\left\| {\xi _i} \right\| ^2}}\nonumber \\&s.t.\ \ h\left( x_i\right) \cdot \beta =t_i-\xi _i,\ i=1,\ldots ,N, \end{aligned}$$

(6)

where $\xi _i$ is the training error, c is the penalty parameter, $s_i=\left\{ \begin{array}{ll} 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| }{r^++\delta },&{}t_i=+1\\ 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| }{r^-+\delta },&{}t_i=-1\\ \end{array}\right. $ is the membership degree of $x_i$ in the random mapping feature space, ${\widetilde{\mathcal {C}}}^+=\frac{1}{N^+}\sum _{x_i\in \mathcal {X}^+} h\left( x_i\right) $ and ${\widetilde{\mathcal {C}}}^-=\ \frac{1}{N^-}\sum _{x_i\in \mathcal {X}^-} h\left( x_i\right) $ are the centers of the positive class and negative class, respectively, and $r^+=\max _{x_i\in \mathcal {X}^+}(\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| )$ and $r^-=\max _{x_i\in \mathcal {X}^-}(\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| )$ are the radius of the positive class and negative class, respectively.

However, FELM only considers the membership degree of samples, which can easily mistake some boundary samples for noise. To identify the noise in the support vectors, Rezvani et al. proposed intuitionistic fuzzy twin support vector machines (IFTSVM), which use the intuitionistic fuzzy sets (IFSs) to construct the score values of samples [38]. In order to enhance the robustness and re-sampling stability of IFTSVM, Liang et al. proposed an intuitionistic fuzzy twin support vector machines with the $\varepsilon $-insensitive pinball loss (PIFTSVM) [39]. It defines the score function named SFA:

$$\begin{aligned} s(x)=\sqrt{\frac{{\mu (x)}^2+(1-\nu (x))^2}{2}}, \end{aligned}$$

(7)

where $\mu (x)$ is the membership function, and $\nu (x)$ is the non-membership function. Laxmi et al. proposed multi-category intuitionistic fuzzy twin support vector machines to solve the multi-class classification problems [40]. In order to effectively solve the problem of class imbalance, Rezvani et al. proposed class balance learning using fuzzy ART and intelligent fuzzy twin support vector machines.

3 Intuitionistic Fuzzy Extreme Learning Machines with the Truncated Pinball Loss

In this section, we propose TPin-IFELM to address the drawbacks of FELM. The algorithm framework of TPin-IFELM is shown in Fig. 2.

3.1 Intuitionistic Fuzzy Settings

FELM easily mistakes some boundary samples for noise due to its only using membership degree. To address this issue, in this subsection, We employ IFS for each sample to reduce the negative impact of noise.

Define an intuitionistic fuzzy set $\bar{A}=\ \left\{ \left( x,\mu _{\bar{A}}\left( x\right) ,\nu _{\bar{A}}\left( x\right) \right) |\ x\in \mathcal {X}\right\} $, where $\mu _{\bar{A}}$: $\mathcal {X}$ $\rightarrow \left[ 0,1\right] $ is the membership degree of x in $\mathcal {X}$, $\nu _{\bar{A}}: \mathcal {X}\rightarrow \left[ 0,1\right] $ is the non-membership degree of x in $\mathcal {X}$, and $0\le \mu _{\bar{A}}\left( x\right) +\nu _{\bar{A}}\left( x\right) \le 1$. We illustrate the acquisition of membership and non-membership degrees through the following examples.

3.1.1 Intuitionistic Fuzzy Membership Degree

In the random mapping feature space, the membership degree of samples is determined by the distance between samples and the class center, i.e.,

$$\begin{aligned} \mu _i=\left\{ \begin{array}{ll} 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| }{r^++\varrho },&{}t_i=+1\\ 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| }{r^-+\varrho },&{}t_i=-1\\ \end{array}\right. , \end{aligned}$$

(8)

where $1 \le i\le N$, $\varrho >0$ is an adjustable parameter in the random mapping feature space.

Example 1

Let $h(x_*) = (0.91, 0.27, 0.21, 0.22, 0.23)$, $t_*=+1$, ${\widetilde{\mathcal {C}}}^+ = (0.80, 0.52, 0.40, 0.57, 0.43)$ is the center of the positive samples, and $r^+ = 0.87$ is the radius of positive samples. According to Eq. (8), $\mu _*=1-\frac{0.5227}{0.87+{10}^{-7}}=0.3992$.

3.1.2 Intuitionistic Fuzzy Non-membership Degree

We can effectively capture the correlation between $x_i$ and all heterogeneous samples in its neighborhood by using the KNN method, i.e.,

$$\begin{aligned} \rho \left( x_i\right) =\frac{\vert \{x_j\mid h(x_j)\in KNN\left( h\left( x_i\right) \right) ,t_j\ne t_i\}\vert }{K}, \end{aligned}$$

(9)

where $KNN\left( h\left( x_i\right) \right) $ is used to represent the K nearest neighbors of $x_i$ in the random mapping feature space.

The non-membership degree $\upsilon _i$ is defined as:

$$\begin{aligned} \upsilon _i=\left( 1-\mu _i\right) \rho \left( x_i\right) , \end{aligned}$$

(10)

and $0\le \mu _i+\upsilon _i\le 1.$

Example 2

Let $K = 5$ and $\rho \left( x_i\right) = \frac{4}{5}$. According to Eq. (10), $\upsilon _*=\left( 1-0.3992\right) \times \frac{4}{5}=0.4806$.

3.1.3 The Score Function

We construct an IFS $\breve{S}=\left\{ \left( x_1,t_1,\mu _1,\nu _1\right) ,\left( x_2,t_2,\mu _2,\nu _2\right) ,\ldots ,\left( x_N,t_N,\mu _N,\nu _N\right) \right\} $. According to $\breve{S}$, we construct the score value (SV) as follows:

$$\begin{aligned} s_i=\ \left\{ \begin{array}{cc} \mu _i,&{}\nu _i=0\\ 0,&{}\mu _i\le \nu _i\\ \frac{1-\nu _i}{2-\mu _i-\nu _i},&{}\mu _i>\nu _i\ \textrm{and}\ \nu _i\ne 0\\ \end{array}\right. , 1\le i\le N, \end{aligned}$$

(11)

where $s_i=\mu _i$ indicates that $x_i$ is a correctly classified sample; $s_i=0$ indicates that $x_i$ is the noise; $s_i=\frac{1-\nu _i}{2-\mu _i-\nu _i}$ indicates that $x_i$ is the support vector of the corresponding class, not the noise.

3.2 Linear Case

Unlike FELM [32] which uses the least squares loss function, TPin-IFELM employs the truncated pinball loss function, which not only makes the model robust to the noises but also preserves the sparsity. The truncated pinball loss function (see Fig. 3) is as follows:

$$\begin{aligned} P_{\tau ,\varsigma }\left( x,t,f\left( x\right) \right) =\left\{ \begin{array}{cc} \tau \varsigma ,&{}\textrm{U}\le -\varsigma \\ -\tau \textrm{U},&{}-\varsigma<\textrm{U}<0\\ \textrm{U},&{}\textrm{U}\ge 0\\ \end{array}\right. , \end{aligned}$$

(12)

where $0\le \tau \le 1$, and $\varsigma >0$ is the preset value, and t is the label of x.

As shown in Fig. 3, the truncated pinball loss function takes into account the advantages of the hinge loss function and pinball loss function, so it has noise robustness and sparsity.

Equation (12) can be decomposed as follows:

$$\begin{aligned} P_{\tau ,\varsigma }\left( x,t,f\left( x\right) \right) =H_{1+\tau }\left( 1-tf\left( x\right) \right) -\left( H_\tau \left( 1-tf\left( x\right) +\varsigma \right) -\tau \varsigma \right) , \end{aligned}$$

(13)

where $H_{1+\tau }\left( 1-tf\left( x\right) \right) = \left( 1+\tau \right) max\left( 0,1-tf(x)\right) $ and $H_\tau \left( 1-tf\left( x\right) +\varsigma \right) = \tau max\left( 0,1-tf\left( x\right) +\varsigma \right) $.

We replace the least squares loss function in Eq. (6) with the truncated pinball loss and employ the score value for each sample as follows:

$$\begin{aligned} {min}_\beta \ J\left( \beta \right) =\ \frac{1}{2}\left\| {\beta }\right\| ^2+\ c\sum _{i=1}^{N}{s_iP}_{\tau ,\varsigma }\ \left( 1-t_if_\beta \left( x_i\right) \right) , \end{aligned}$$

(14)

where c is the penalty parameter.

The gradient $\nabla _\beta \left( J\left( \beta \right) \right) $ of $J\left( \beta \right) $ is as follows:

$$\begin{aligned} \nabla _\beta \left( J\left( \beta \right) \right) \ =\ \beta -c\sum _{i=1}^{N}s_it_ih\left( x_i\right) \partial _\beta P_{\tau ,\varsigma }\left( 1-t_if_\beta \left( x_i\right) \right) . \end{aligned}$$

(15)

It can be proved that the minimum of (14) with respect to $\beta $ should satisfy the following condition

$$\begin{aligned} \beta =c\sum _{i=1}^{N}s_it_ih\left( x_i\right) \partial _\beta P_{\tau ,\varsigma }\left( 1-t_if_\beta \left( x_i\right) \right) . \end{aligned}$$

(16)

The function $J\left( \beta \right) $ in (14) can be decomposed into the sum of the convex function $J_{vex}(\beta )$ and the concave function $J_{cav}(\beta )$, i.e.,

$$\begin{aligned} J\left( \beta \right)&={\underbrace{\ \frac{1}{2}\left\| {\beta }\right\| ^2+c\sum _{i=1}^{N}{s_iH}_{1+\tau }\left( 1-t_if_\beta \left( x_i\right) \right) }_{J_{vex}\left( \beta \right) }}\nonumber \\&\quad {\underbrace{-c\sum _{i=1}^{N}{s_iH_\tau }\left( 1-t_if_\beta \left( x_i\right) +\varsigma \right) +c\tau \varsigma \sum _{i=1}^{N}s_i}_{J_{cav}\left( \beta \right) }}. \end{aligned}$$

(17)

Obviously, (17) is a non-differentiable non-convex optimization problem, which can be solved by the CCCP. The detailed procedure of the CCCP is shown in Algorithm 1.

Using the CCCP to solve (17), the subproblem of the kth iteration can be expressed as:

$$\begin{aligned}&{min}_\beta \ J_{vex}\left( \beta \right) +\nabla _\beta {\left( J_{cav}\left( \beta ^{\left( k-1\right) }\right) \right) }^{\textrm{T}}\beta \nonumber \\&\quad =\frac{1}{2}\left\| {\beta }\right\| ^2+c\sum _{i=1}^{N}{s_iH_{1+\tau }}\left( 1-t_if_\beta \left( x_i\right) \right) +\sum _{i=1}^{N}{\delta _i^{k-1}t_if_\beta \left( x_i\right) }, \end{aligned}$$

(18)

where

$$\begin{aligned} \delta _i^{k-1}=\left\{ \begin{array}{cc} cs_i\tau ,&{}t_if_{\beta ^{k-1}}\ \left( x_i\right) = t_i\left( h\left( x_i\right) \cdot \beta ^{k-1}\right) <\varsigma +1\\ 0,&{}\ \ \ \textrm{otherwise}\\ \end{array}\right. . \end{aligned}$$

(19)

By introducing the slack variables $\xi =\left[ \xi _1,\ldots ,\xi _N\right] ^{\textrm{T}}$, (18) is equivalent to the following form:

$$\begin{aligned}&{min_{\beta ,\xi _i}}\ \frac{1}{2}\left\| {\beta } \right\| ^2+c\sum _{i=1}^{N}{s_i\xi _i}+\ \sum _{i=1}^{N}{\delta _i^{k-1}t_if_\beta \left( x_i\right) }\nonumber \\&s.t.\ t_if_\beta \left( x_i\right) \ge \ 1-\ {\frac{1}{1+\tau }\xi _i},{\ \xi }_i\ge 0,\ i=1\ldots N. \end{aligned}$$

(20)

According to the Lagrange method, we can obtain the following dual problem. The detailed solution process is shown in Appendix A.

$$\begin{aligned}&min_\alpha {\ \frac{1}{2}({\alpha }^{\textrm{T}} -{\delta }^{\textrm{T}})Q(\alpha -\delta )-{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ \ 0\le \ \alpha \le \left( 1+\tau \right) cS, \end{aligned}$$

(21)

where $Q=TH{H}^{\textrm{T}}T$.

Set $\lambda =\ \alpha -\delta $, and the lower and upper bounds of the box constraint are defined as $\mathfrak {L}=-\delta \in \mathfrak {R}^N$ and $\mathfrak {U}=\left( 1+\tau \right) cS-\ \delta \in \mathfrak {R}^N$. Then, (21) is equivalent to

$$\begin{aligned}&min_\lambda {\ \frac{1}{2}\lambda Q\lambda -{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ \ \mathfrak {L}\le \ \lambda \le \mathfrak {U}. \end{aligned}$$

(22)

The label t of the unknown sample x is determined by the following decision function.

$$\begin{aligned} f\left( x\right) =sign\left( {\lambda }^{\textrm{T}} THh(x)\right) . \end{aligned}$$

(23)

The complete process of linear TPin-IFELM is shown in Algorithm 2.

3.3 Nonlinear Case

In the ELM kernel space [2, 3, 41], the membership degree of the sample is defined by

$$\begin{aligned} \mu _i^\Phi =\left\{ \begin{array}{cc} 1-\ \frac{\sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^+}\sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)} +\frac{1}{{N^+}^2}\sum _{x_i\in \mathcal {X}^+} \sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)}}}{r^++\varrho },&{} t_i=+1\\ 1-\ \frac{\sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^-} \sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM}(x_i,x_j)}+\frac{1}{{N^-}^2} \sum _{x_i\in \mathcal {X}^-}\sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM} (x_i,x_j)}}}{r^-+\varrho },&{}t_i=-1\\ \end{array}\right. , \end{aligned}$$

(24)

and the non-membership degree of the sample is defined as

$$\begin{aligned} \nu _i^\Phi =(1-\ \mu _i^\Phi )\rho \left( x_i\right) , \end{aligned}$$

(25)

where $\mathcal {K}_{ELM}\left( x_i,x_i\right) = h\left( x_i\right) \cdot h\left( x_i\right) $, $\mathcal {K}_{ELM}(x_i,x_j)= h\left( x_i\right) \cdot h(x_j)$, $1\le i\le N$,

$$\begin{aligned} r^+= & {} \max _{x_i\in \mathcal {X}^+}\\{} & {} {\left( \sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^+}\sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)}+\frac{1}{{N^+}^2}\sum _{x_i\in \mathcal {X}^+}\sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)}}\right) }, \end{aligned}$$

and

$$\begin{aligned} r^-= & {} \max _{x_i\in \mathcal {X}^-}\\{} & {} {\left( \sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^-}\sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM}(x_i,x_j)}+\frac{1}{{N^-}^2}\sum _{x_i\in \mathcal {X}^-}\sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM}(x_i,x_j)}}\right) }. \end{aligned}$$

According to Eq. (24) and Eq. (25), the score function is defined as follows:

$$\begin{aligned} s_i^\Phi =\ \left\{ \begin{array}{cc} \mu _i^\Phi ,&{}\nu _i^\Phi =0\\ 0,&{}\mu _i^\Phi \le \nu _i^\Phi \\ \frac{1-\nu _i^\Phi }{2-\mu _i^\Phi -\nu _i^\Phi },&{}\mu _i^\Phi >\nu _i^\Phi \ {\textrm{and}}\ \nu _i^\Phi \ne 0\\ \end{array}\right. . \end{aligned}$$

(26)

The original problem of nonlinear TPin-IFELM can be expressed as

$$\begin{aligned} {min}_\varpi \ J\left( \varpi \right) =\ \frac{1}{2}\left\| {\varpi }\right\| ^2+\ C\sum _{i=1}^{N}{s_i^\Phi P}_{\tau ,\varsigma }\ \left( 1-t_i{\varpi }^{\textrm{T}} h\left( x_i\right) \right) . \end{aligned}$$

(27)

Similar to linear TPin-IFELM, (27) can be solved by CCCP. In the kth iteration, the subproblem of (27) can be expressed as

$$\begin{aligned}&min_{\varpi ,\xi }\ {\frac{1}{2}\left\| {\varpi }\right\| ^2+c\sum _{i=1}^{N}{s_i^\Phi \xi _i}+\ \sum _{i=1}^{N}{{\delta _i^\Phi }^{k-1}t_i{\varpi }^{\textrm{T}} h\left( x_i\right) }}\nonumber \\&s.t.\ t_i{\varpi }^{\textrm{T}} h\left( x\right) \ge \ 1-\ {\frac{1}{1+\tau }\xi _i},{\ \xi }_i\ge 0,\ i=1\ldots N, \end{aligned}$$

(28)

where $\varpi $ is the output weight vector in the ELM kernel space, and

$$\begin{aligned} {\delta _i^\Phi }^{k-1}=\left\{ \begin{array}{cc} cs_i^\Phi \tau ,&{}\ t_if_{\beta ^{k-1}}\left( x_i\right) = t_ih\left( x_i\right) \varpi ^{k-1}<\varsigma +1\\ 0,&{}\ \ \ \textrm{otherwise}\\ \end{array}\right. . \end{aligned}$$

(29)

The dual problem of (28) is follow as:

$$\begin{aligned}&min_\alpha \ {\frac{1}{2}({\alpha }^{\textrm{T}}-{\delta }^{\textrm{T}})\widetilde{Q}(\alpha -\delta )-{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ 0\le \ \alpha \le \left( 1+\tau \right) cS^\Phi , \end{aligned}$$

(30)

where $\widetilde{Q}=T\Omega _{ELM}T$ and $\Omega _{ELM}=H{H}^{\textrm{T}}\in \mathfrak {R}^{N\times N}$ whose element ${\Omega _{ELM}}_{ij}=\mathcal {K}_{ELM}(x_i,x_j)$.

Similar to linear TPin-IFELM, Eq. (30) is equivalent to

$$\begin{aligned}&min_\lambda \ {\frac{1}{2}\lambda ^\Phi \widetilde{Q}\lambda ^\Phi -{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ \mathfrak {L}^\Phi \le \ \lambda ^\Phi \le \mathfrak {U}^\Phi , \end{aligned}$$

(31)

where $\mathfrak {L}^\Phi =-\delta ^\Phi \in \mathfrak {R}^N$ and $\mathfrak {U}^\Phi =\left( 1+\tau \right) cS^\Phi -\ \delta ^\Phi \in \mathfrak {R}^N$.

For the unknown sample x, the decision function of nonlinear TPin-IFELM is

$$\begin{aligned} f\left( x\right) =sign\left( {\lambda ^\Phi }^{\textrm{T}} T\left[ \begin{array}{c} \mathcal {K}_{ELM}\left( x_1,x\right) \\ \vdots \\ \mathcal {K}_{ELM}\left( x_N,x\right) \\ \end{array}\right] \right) . \end{aligned}$$

(32)

The complete process of nonlinear TPin-IFELM is shown in Algorithm 3.

3.4 The Discussion

In this section, we discuss the relationship between TPin-IFELM and FELM. Similar to ELM, TPin-IFELM, and FELM randomly assign the input weights and the biases of the hidden layer. Then, the hidden layer output matrix is obtained by the activation function.

In order to suppress the negative effects of noises, FELM only uses the membership degree for each sample, while TPin-IFELM employs the membership and non-membership degrees based on the local information of samples. To further reduce the interference of noises, TPin-IFELM uses the truncated pinball loss function to not only maintain sparsity and robustness but also to enhance the re-sampling stability.

4 Properties of the TPin-IFELM

In this section, we analyze the theoretical properties of TPin-IFELM, including noise insensitivity, sparsity, weight scatter minimization, and misclassification error minimization.

4.1 Noise Insensitivity and Sparsity

In this subsection, we discuss the noise insensitivity and sparsity of TPin-IFELM. The sub-gradient function of (12) is

$$\begin{aligned} \partial P_{\tau ,\varsigma }\left( 1-t_if\left( x_i\right) \right) =\left\{ \begin{array}{cc} 0, &{} 1-t_if\left( x_i\right)<-\varsigma \\ \left[ -\tau ,0\right] , &{} 1-t_if\left( x_i\right) =-\varsigma \\ -\tau , &{} -\varsigma<1-t_if\left( x_i\right) <0\\ \left[ -\tau ,1\right] , &{} 1-t_if\left( x_i\right) =0\\ 1, &{} 1-t_if\left( x_i\right) >0\\ \end{array}\right. . \end{aligned}$$

(33)

Equation (16) can be rewritten as:

$$\begin{aligned} \textbf{0}\in \frac{\beta }{c}-\sum _{i=1}^{N}s_it_ih\left( x_i\right) \partial P_{\tau ,\varsigma }\left( 1-t_i{h\left( x_i\right) }^{\textrm{T}}\beta \right) , \end{aligned}$$

(34)

where $\textbf{0}\in \mathfrak {R}^N$ is a column vector whose elements are all zero.

For given $\beta $, the index set can be partitioned into five sets,

$$\begin{aligned} \mathcal {S}_0^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta<-\varsigma \right\} ,\nonumber \\ \mathcal {S}_1^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta =-\varsigma \right\} ,\nonumber \\ \mathcal {S}_2^\beta&=\left\{ i:-\varsigma<1-t_i{h \left( x_i\right) }^{\textrm{T}}\beta <0\right\} ,\nonumber \\ \mathcal {S}_3^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta =0\right\} ,\nonumber \\ \mathcal {S}_4^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta >0\right\} . \end{aligned}$$

(35)

Since $\partial P_{\tau ,\varsigma }\left( 1-t_if\left( x_i\right) \right) =0$ when the samples are located in $\mathcal {S}_0^\beta $, the samples in $\mathcal {S}_0^\beta $ have no contribution to the calculation of $\beta $. Therefore, $\mathcal {S}_0^\beta $ is closely related to the sparsity of (14), in the other word, the parameter $\varsigma $ can control the number of samples in $\mathcal {S}_0^\beta $. When the value of $\varsigma $ is smaller, the more samples in $\mathcal {S}_0^\beta $, and the better the sparsity of (14). In particular, when $\varsigma \rightarrow 0$, the truncated pinball loss function can be regarded as a hinge loss function, which is very sensitive to the noises. On the contrary, when the value of $\varsigma $ is larger, the number of samples in $\mathcal {S}_0^\beta $ is smaller, and (14) is robust to noises but gradually loses its sparsity. Particularly, when $\varsigma \rightarrow +\infty $, the truncated pinball loss function degenerates into pinball loss, and the sparsity is completely lost.

According to the five sets $\mathcal {S}_0^\beta $, $\mathcal {S}_1^\beta $, $\mathcal {S}_2^\beta $, $\mathcal {S}_3^\beta $ and $\mathcal {S}_4^\beta $ of (35), the optimality condition can be written as the existence of $\psi _i\in \left[ -\tau ,0\right] $ and $\zeta _i\in \left[ -\tau ,1\right] $ such that

$$\begin{aligned}&\frac{\beta }{c}-\sum _{i\epsilon \mathcal {S}_1^\beta }{\psi _is_i}t_ih\left( x_i\right) +\tau \sum _{i\epsilon \mathcal {S}_2^\beta } s_it_ih\left( x_i\right) \nonumber \\ {}&\quad -\sum _{i\epsilon \mathcal {S}_3^\beta }{\zeta _is_i}t_ih\left( x_i\right) -\sum _{i\epsilon \mathcal {S}_4^\beta } s_it_ih\left( x_i\right) =0. \end{aligned}$$

(36)

The number of samples in $\mathcal {S}_1^\beta $ and $\mathcal {S}_3^\beta $ is much smaller than that in $\mathcal {S}_0^\beta $, $\mathcal {S}_2^\beta $ and $\mathcal {S}_4^\beta $, and the samples in $\mathcal {S}_1^\beta $ and $\mathcal {S}_3^\beta $ make little contribution to Eq. (36). Therefore, the main problem here is about the set $\mathcal {S}_0^\beta $, $\mathcal {S}_2^\beta $ and $\mathcal {S}_4^\beta $. When the value of parameter $\varsigma $ is fixed to a suitable value, the parameter $\tau $ can be used to control the number of samples in $\mathcal {S}_0^\beta $, $\mathcal {S}_2^\beta $ and $\mathcal {S}_4^\beta $, and the sparsity of (14) is affected. When $\tau $ is large, such as $\tau =1$, these three sets contain many samples, so (14) is robust to the feature noise. When $\tau $ is very small, such as $\tau =0.1$, there are few samples in $\mathcal {S}_4^\beta $, and (14) is more sensitive. Especially, when $\tau =0$, there are no samples or only a few samples in $\mathcal {S}_4^\beta $. Therefore, when constructing the model, the feature noise around the decision boundary will bring significant negative effects. Since the total number of samples is fixed when $\tau $ is smaller, the smaller the number of samples in $\mathcal {S}_4^\beta $, the larger the number of samples in ${\ \mathcal {S}}_0^\beta $, and the better the sparsity of (14).

In summary, the appropriate $\tau $ and $\varsigma $ are chosen to enable TPin-IFELM to better balance noise insensitivity and sparsity.

4.2 Weight Scatter and Misclassification Error Minimization

The mechanism of TPin-IFELM can also be explained by the weight scatter and misclassification error minimization. The positive hyperplane $f_+\left( x\right) :{\beta }^{\textrm{T}} h\left( x\right) =1$ and the negative hyperplane $f_-\left( x\right) :{\beta }^{\textrm{T}} h\left( x\right) =-1$ are constructed by the samples in $\mathcal {S}_3^\beta $. The distance between positive and negative hyperplanes is $\frac{2}{\left\| {\beta }\right\| }$. We can measure the weight scatter in terms of the sum of the distances of a given point from similar samples. In the random mapping feature space related to $\beta $, the weight scatter of $x_{i_0}$ can be defined as

$$\begin{aligned} \sum _{x_{i_0}\in \mathcal {S}_3^\beta \cap x_i\in \mathcal {X}}{s_i\vert {{\beta }^{\textrm{T}}\left( h\left( x_{i_0}\right) -h\left( x_i\right) \right) }\vert .} \end{aligned}$$

(37)

If $x_{i_0}\in \mathcal {S}_3^\beta \cap \mathcal {X}^+$, i.e., ${{\beta }^{\textrm{T}}}{h\left( x_{i_0}\right) }=1$ and $t_{i_0}=1$, then

$$\begin{aligned} \sum _{x_i\in \mathcal {X}^+}{s_i\vert {{\beta }^{\textrm{T}}\left( h\left( x_{i_0}\right) -h\left( x_i\right) \right) } \vert } = \sum _{x_i\in \mathcal {X}^+}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) }\vert }; \end{aligned}$$

(38)

If $x_{i_0}\in \mathcal {S}_3^\beta \cap \mathcal {X}^-$, i.e., ${{\beta }^{\textrm{T}}}{h\left( x_{i_0}\right) }=-1$ and $t_{i_0}=-1$, then

$$\begin{aligned} \sum _{x_i\in \mathcal {X}^-}{s_i\vert {{\beta }^{\textrm{T}}\left( h\left( x_{i_0}\right) -h\left( x_i\right) \right) } \vert } = \sum _{x_i\in \mathcal {X}^-}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) }\vert }. \end{aligned}$$

(39)

Therefore,

$$\begin{aligned} {min}_\beta \frac{1}{2}\left\| {\beta }\right\| ^2+C_1\sum _{i=1}^{N}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) }\vert } \end{aligned}$$

(40)

can be interpreted as maximizing the distance between hyperplanes $f_+\left( x\right) $ and $f_-\left( x\right) $ and meanwhile minimizing weight scatter.

In (14), (40) is extended to $P_{\tau ,\varsigma }$. The misclassification term

$$\begin{aligned} C_1min\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) \right) ,0\right) - C_2\left( L_{hinge}\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) +\varsigma \right) \right) -s_i\varsigma \right) \end{aligned}$$

is introduced into Eq. (40), i.e.,

$$\begin{aligned}&{min}_\beta \ \frac{1}{2}\left\| {\beta }\right\| ^2+C_1\sum _{i=1}^{N}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) } \vert }\nonumber \\&\quad +C_1\sum _{i=1}^{N}min\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}}h\left( x_i\right) \right) \right) ,0\right) \nonumber \\&\quad -C_2\sum _{i=1}^{N}\left( L_{hinge}\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) +\varsigma \right) \right) -s_i\varsigma \right) . \end{aligned}$$

(41)

We obtain TPin-IFELM with $C_1=c\left( 1+\tau \right) $ and $C_2=c\tau $. Thus, TPin-IFELM can minimize both the weight scatter and misclassification errors, simultaneously.

5 Experiments

In this section, we verify the effectiveness of TPin-IFELM through a series of experiments on the artificial dataset and benchmark datasets.^{Footnote 1}

5.1 Experimental Configuration

In order to evaluate the effectiveness of TPin-IFELM, we compared it with other eight advanced comparison algorithms. TPin-IFELM with SFA, which replaces the score function SV in TPin-IFELM with the score function SFA in PIFTSVM, contains four parameters c, L, $\tau $, and $\varsigma $, and TPin-IFELM contains five parameters c, L, $\tau $, $\varsigma $ and K. To ensure the objectivity of the experiments, for the datasets with less than 2000 samples, the penalty parameter c of TPin-IFELM and TPin-IFELM with SFA, the penalty parameter C of OELM, RELM, and FELM, and the penalty parameters $C_1$ and $C_2$ of TELM, SPTELM, and PIFTSVM are searched from the set $\left\{ 2^i|i=-10,-8,\ldots ,8,10\right\} $, and the number of hidden layer nodes L for these algorithms are searched from $\left\{ 50,100,200,500\right\} $. For the datasets with greater than or equal to 2000 samples, the penalty parameters c, C, $C_1$ and $C_2$ are searched from $\left\{ 2^i|i=-10,-6,\ldots ,6,10\right\} $, and the number of hidden layer nodes L is searched from $\left\{ 50,100,200\right\} $. $\tau $ and $\varsigma $ are searched from $\left\{ 0.25,0.5,0.75\right\} $, $\varepsilon $ is searched from $\{0,0.2,0.5\}$, and for TPin-IFELM, the number of nearest neighbors K is searched from $\left\{ 1,3,\ldots ,20\right\} $.

We implement all algorithms by using MATLAB (R2018a). The experimental environment is a workstation with the 11th Gen Intel Core i5-11,400 H (2.70GHz) processor and 16 G RAM. We use the quadprog to solve the quadratic programming problem and use three evaluation metrics to evaluate the classification performance, including accuracy (ACC), the area under ROC (AUC), and $F_1$-measure $(F_1)$.

$$\begin{aligned} ACC&=\frac{TP+TN}{TP+FN+TN+FP}, \end{aligned}$$

(42)

$$\begin{aligned} F_1&=\frac{2\times T P}{2\times T P+FP+FN}, \end{aligned}$$

(43)

$$\begin{aligned} AUC&=\frac{\vert \{(x_i,x_j)\mid f(x_j)\le f\left( x_i\right) ,(x_i,x_j)\in P\times N\}\vert }{N^+\times N^-}, \end{aligned}$$

(44)

where FN denotes the number of false negatives, FP denotes the number of false positives, TN denotes the number of true negatives and TP denotes the number of true positives.

5.2 Experiments on the Artificial Dataset

To verify the robustness and sparsity of TPin-IFELM, we conduct comparative experiments on an artificial dataset with two-dimensional features. The training set and test sets consist of 200 samples and 50 samples, respectively. The positive and negative samples of the artificial dataset are generated by the Gaussian distributions $\mathcal {X}^+\sim \mathcal {N}\left( \mathcal {V}_1,\Sigma _1\right) $ and $\mathcal {X}^-\sim \mathcal {N}\left( \mathcal {V}_2,\Sigma _2\right) $, respectively, where $\mathcal {V}_1=\left[ \begin{array}{cc} 1&{}1\\ \end{array}\right] ^{\textrm{T}}$, $\mathcal {V}_2=\left[ \begin{array}{cc} -1&{}-1\\ \end{array}\right] ^{\textrm{T}}$ and $\Sigma _1=\Sigma _2=\left[ \begin{array}{cc} 1&{}\\ &{}1\\ \end{array}\right] $.

As shown in Fig. 4, the red “+” and blue “$\times $” denote the positive training samples and the negative training samples, respectively. The pink “+” and green “$\times $” denote the positive test samples and the negative test samples, respectively. The support vectors are circled by “$\circ $”, and the noises identified by the algorithm are framed by black “$\diamond $”. We can see that compared with FELM, TPin-IFELM with SFA and TPin-IFELM use both the membership and non-membership degrees and the truncated pinball loss function, so they can more effectively reduce the negative effect of noises. The number of support vectors of TPin-IFELM with SFA and TPin-IFELM is 33% and 29% of the total number of samples, respectively. Thus, compared with FELM, TPin-IFELM with SFA and TPin-IFELM are sparse. Table 1 shows the experimental results of FELM, TPin-IFELM with SFA, and TPin-IFELM on the artificial dataset, and the best results of each evaluation indicator are shown in bold. As shown in Table 1, TPin-IFELM is superior to FELM and TPin-IFELM with SFA in terms of ACC and AUC and is second only to TPin-IFELM with SFA in terms of $F_1$.

Table 1 The experimental results on artificial dataset

Full size table

5.3 Experiments on the Benchmark Datasets

To evaluate the effectiveness and robustness of TPin-IFELM, we conduct comparative experiments on 15 benchmark datasets. The detailed characteristics of the datasets are shown in Table 2, where #Samples, #Positive samples, #Negative samples, and #Features denote the number of samples, the number of positive samples, the number of negative samples and the number of features, respectively.

Table 2 The characteristics of experimental datasets

Full size table

In order to verify the classification performance of TPin-IFELM and other eight comparison algorithms, we conduct extensive experiments on fifteen benchmark datasets. Appendix B provides additional experimental results. Unlike these seven comparison algorithms, TPin-IFELM with SFA and TPin-IFELM employ the membership and non-membership degrees to effectively identify the role of the samples in the classification process. As shown in Tables 7, 8 and 9, in terms of the average rank, each evaluation metric of TPin-IFELM is superior to that of the other eight algorithms, and the ACC and AUC of TPin-IFELM with SFA are only lower than TPin-IFELM.

Table 3 ACC of nine algorithms in the 50% label noise and 0.5 Gaussian feature noise environment

Full size table

Table 4 AUC of nine algorithms in the 50% label noise and 0.5 Gaussian feature noise environment

Full size table

Table 5 $F_1$ of nine algorithms in the 50% label noise and 0.5 Gaussian feature noise environment

Full size table

Noise is commonly present in the datasets and can reduce the classification performance of algorithms. In order to demonstrate the robustness of TPin-IFELM, we conducted noise experiments on 15 benchmark datasets using label noise. We randomly select 50% of the training samples and then add label noise to them. The experimental results are shown in Tables 10, 11 and 12. We can observe that all algorithms are negatively affected by the samples with label noise. However, TPin-IFELM is less disturbed by label noise than the other eight comparison algorithms. In addition, it is superior to the other eight comparison algorithms on most datasets. For the classification problems with label noise and feature noise, we add Gaussian noise [39] that follows normal distribution $N\left( 0,\sigma ^2\right) $ to the training set to form a training set with feature noise, where $\sigma $ is 0.5, and then randomly select 50% of the training samples as the samples with label noise. Tables 3, 4 and 5 show the experimental results. The best results for each dataset are shown in bold. TPin-IFELM with SFA and TPin-IFELM are less disturbed by label noise and feature noise than the other seven algorithms and are superior to them on most datasets.

From the above noise experimental results, we can observe that ELM, OELM, RELM, TELM, and SPTELM do not consider the membership degree of the samples to reduce the negative impact of the noise, resulting in a significant decrease in their classification performance. Different from FELM, TPin-IFELM with SFA and TPin-IFELM employ the membership and non-membership degrees to effectively identify the role of the samples and the noise in the classification process. At the same time, they introduce the truncated pinball loss function to enhance the robustness of the model. Compared to TPin-IFELM with SFA, TPin-IFELM uses the local information of the samples to construct the more appropriate membership and non-membership degrees of the samples. Therefore, TPin-IFELM can better solve the classification problems with noise than the other eight comparison algorithms.

5.4 Statistical Analysis

From Tables 7, 8, 9, 10, 11 and 12 and Tables 3, 4 and 5, we can observe that not any algorithm outperforms all other algorithms on all datasets. In this subsection, we use the Friedman test [42] to analyze these algorithms statistically. Given $\mathfrak {K}$ comparison algorithms and $\mathcal {N}$ datasets, let $r_i^j$ denote the rank of the j-th algorithm on the i-th dataset. $R_j=\frac{1}{\mathcal {N}}\sum _{i=1}^{\mathcal {N}}r_i^j$ denotes the average rank of the j-th algorithm. The Friedman statistics $F_F=\frac{\left( \mathcal {N}-1\right) \chi _F^2}{\mathcal {N} \left( \mathfrak {K}-1\right) -\chi _F^2}\sim F\left( \mathfrak {K}-1,\left( \mathfrak {K}-1\right) \left( \mathcal {N}-1\right) \right) $, where $\chi _F^2=\frac{12\mathcal {N}}{\mathfrak {K} \left( \mathfrak {K}+1\right) }\left[ \sum _{j=1}^{\mathfrak {K}}{R_j^2 -\frac{\mathfrak {K}\left( \mathfrak {K}+1\right) ^2}{4}}\right] $. Table 6 shows the Friedman test results on the datasets without noise and datasets with noise. We observe that the Friedman statistics are much larger than the critical value, so the null hypothesis that all algorithms have the same classification performance is rejected, i.e., there is a significant difference in classification performance among the algorithms.

Table 6 Summary of Friedman statistics with and without noise and critical values with and without noise for all evaluation metrics

Full size table

The difference between TPin-IFELM and the other eight algorithms is compared by using the Nemenyi test [42]. The average rank difference between pairs of algorithms is compared by the critical difference (CD), where $\textrm{CD}=q_\alpha \sqrt{\frac{\mathfrak {K}\left( \mathfrak {K}+1\right) }{6\mathcal {N}}}$. For the Nemenyi test, $q_\alpha =2.948$ at the significance level $\alpha =0.05$, thus, for experiments without noise, $\textrm{CD}=3.102\ \left( \mathfrak {K}=9,\mathcal {N}=15\right) $; for experiments with noise, $\textrm{CD}=1.5510\ (\mathfrak {K}=9, \mathcal {N}=60)$. The CD diagrams of all evaluation metrics with and without noise are shown in Fig. 5. We observe that TPin-IFELM is superior to the other eight algorithms on each evaluation metric.

5.5 Sensitivity Analysis

To analyze the parameter sensitivity of TPin-IFELM and the performance of methods for obtaining sample structure information, we conduct experiments on the benchmark datasets. The main parameters of TPin-IFELM include the penalty parameter c, the number L of hidden layer nodes, the parameter $\tau $, the parameter K, and the parameter $\varsigma $. The methods for obtaining sample structure information include KNN, K-Means, and Ward Linkage.

5.5.1 Methods for Obtaining Sample Structure Information

In order to investigate the impact of different methods of obtaining sample structure information on our TPin-IFELM, we use KNN, K-Means and Ward Linkage to extract local information of samples, respectively, and conduct experiments on the datasets Sonar and Colon-cancer, respectively. The comparative results are shown in Fig. 6. As shown in Fig. 6, TPin-IFELM using KNN achieves optimal performance. Compared to K-Means and Ward Linkage, KNN can more effectively capture the correlation between the sample and all heterogeneous samples in its neighborhood, thus obtaining valuable local information of samples.

5.5.2 Parameters c and L

To analyze the sensitivity of TPin-IFELM to c and L, we perform parameter sensitivity analysis experiments on the datasets Heart and Ionosphere. The parameter c is searched form $\{2^i\mid i=\ -10,-8,\ldots ,8,10\}$, the parameter L is searched form $\{50,100,200,500\}$, and the other parameters are fixed. From Fig. 7, we can observe that the ACC, AUC, and $F_1$ of TPin-IFELM are higher when the value of c is larger and the L value is larger. In general, TPin-IFELM is sensitive to parameter c and is less affected by the change of L.

5.5.3 Parameters $\tau $, $\varsigma $ and K

To analyze the effects of parameters $\tau $, $\varsigma $, and K on the classification performance of TPin-IFELM, we conducted experiments on the datasets Colon-cancer, Sonar, Heart and Ionosphere without noise and with noise, respectively. There are two types of noise: samples with 30% label noise and samples with 50% label noise and feature noise of $\sigma = 0.5$. As shown in Figs. 8, 9 and 10, for samples without noise, TPin-IFELM is minimally affected by the parameter $\tau $, except for the data set Colon-cancer; however, for samples with noise, TPin-IFELM is strongly affected by the parameter $\tau $. As shown in Figs. 11, 12 and 13, for samples without noise, TPin-IFELM is minimally affected by the parameter $\varsigma $; however, for samples with noise, TPin-IFELM is sensitive to the parameter $\varsigma $. As shown in Figs. 14, 15 and 16, TPin-IFELM is sensitive to the parameter K.

6 Conclusion

Inspired by the intuitionistic fuzzy theory and truncated pinball loss, we propose a novel model to solve the classification problem in this paper. TPin-IFELM employs the KNN method to obtain the local information of samples, which can obtain the more suitable membership and non-membership degrees of samples. TPin-IFELM exploits the membership and non-membership degrees to effectively identify whether the boundary samples are noises or not and uses the truncated pinball loss function, which makes it more robust and sparse. A large number of experiments fully verify the effectiveness of TPin-IFELM. Compared with the state-of-the-art comparison algorithms, TPin-IFELM has superior classification performance. In future work, we will extend the proposed model to the multi-view classification problem.

Data availability

The datasets generated during and/or analyzed during the current study are available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets.php, and the LIBSVM data repository, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

Notes

The datasets are available at https://archive.ics.uci.edu/ml/datasets.php and https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

Huang G, Huang G, Song S, You K (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Google Scholar
Huang G, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Cybernet 42:513–529
Google Scholar
Huang G, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1–3):155–163
Google Scholar
Sun P, Yang L (2022) Generalized eigenvalue extreme learning machine for classification. Appl Intell 52(6):6662–6691
Google Scholar
Ahuja B, Vishwakarma VP (2021) Deterministic multi-kernel based extreme learning machine for pattern classification. Expert Syst Appl 183:115308
Google Scholar
Ren L, Liu J, Gao Y, Kong X, Zheng C (2021) Kernel risk-sensitive loss based hyper-graph regularized robust extreme learning machine and its semi-supervised extension for classification. Knowl-Based Syst 227:107226
Google Scholar
Wong H, Leung H, Leung C, Wong E (2022) Noise/fault aware regularization for incremental learning in extreme learning machines. Neurocomputing 486:200–214
Google Scholar
Luo J, Wong C, Vong C (2021) Multinomial bayesian extreme learning machine for sparse and accurate classification model. Neurocomputing 423:24–33
Google Scholar
Liu Z, Jin W, Mu Y (2020) Variances-constrained weighted extreme learning machine for imbalanced classification. Neurocomputing 403:45–52
Google Scholar
Zong W, Huang G, Chen Y (2013) Weighted extreme learning machine for imbalance learning. Neurocomputing 101:229–242
Google Scholar
Li Y, Zhang J, Zhang S, Xiao W, Zhang Z (2022) Multi-objective optimization-based adaptive class-specific cost extreme learning machine for imbalanced classification. Neurocomputing 496:107–120
Google Scholar
Xiao W, Zhang J, Li Y, Zhang S, Yang W (2017) Class-specific cost regulation extreme learning machine for imbalanced classification. Neurocomputing 261:70–82
Google Scholar
Dutta AK, Qureshi B, Albagory Y, Alsanea M, Al Faraj M, Sait ARW (2023) Optimal weighted extreme learning machine for cybersecurity fake news classification. Comput Syst Sci Eng 44(3):2395–2409
Google Scholar
Tummalapalli S, Kumar L, Neti LBM, Krishna A (2022) Detection of web service anti-patterns using weighted extreme learning machine. Comput Stand Interfaces 82:103621
Google Scholar
El Bourakadi D, Yahyaouy A, Boumhidi J (2022) Improved extreme learning machine with autoencoder and particle swarm optimization for short-term wind power prediction. Neural Comput Appl 34(6):4643–4659
Google Scholar
Xia J, Yang D, Zhou H, Chen Y, Zhang H, Liu T, Heidari AA, Chen H, Pan Z (2022) Evolving kernel extreme learning machine for medical diagnosis via a disperse foraging sine cosine algorithm. Comput Biol Med 141:105137
Google Scholar
Lin Z, Gao Z, Ji H, Zhai R, Shen X, Mei T (2022) Classification of cervical cells leveraging simultaneous super-resolution and ordinal regression. Appl Soft Comput 115:108208
Google Scholar
Gao Z, Hu Q, Xu X (2022) Condition monitoring and life prediction of the turning tool based on extreme learning machine and transfer learning. Neural Comput Appl 34(5):3399–3410
Google Scholar
Wang Y, Li R, Chen Y (2021) Accurate elemental analysis of alloy samples with high repetition rate laser-ablation spark-induced breakdown spectroscopy coupled with particle swarm optimization-extreme learning machine. Spectrochim Acta Part B-Atomic Spectrosc 177:106077
Google Scholar
Wu D, Wang X, Wu S (2022) A hybrid framework based on extreme learning machine, discrete wavelet transform, and autoencoder with feature penalty for stock prediction. Expert Syst Appl 207:118006
Google Scholar
Wang GC, Zhang Q, Band SS, Dehghani M, Chau KW, Tho QT, Zhu S, Samadianfard S, Mosavi A (2022) Monthly and seasonal hydrological drought forecasting using multiple extreme learning machine models. Eng Appl Comput Fluid Mech 16(1):1364–1381
Google Scholar
Wang L, Khishe M, Mohammadi M, Mahmoodzadeh A (2022) Extreme learning machine evolved by fuzzified hunger games search for energy and individual thermal comfort optimization. J Build Eng 60:105187
Google Scholar
Al-Yaseen WL, Idrees AK, Almasoudy FH (2022) Wrapper feature selection method based differential evolution and extreme learning machine for intrusion detection system. Pattern Recogn 132:108912
Google Scholar
Ren Z, Yang L (2018) Correntropy-based robust extreme learning machine for classification. Neurocomputing 313:74–84
Google Scholar
Wang Y, Yang L, Yuan C (2019) A robust outlier control framework for classification designed with family of homotopy loss function. Neural Netw 112:41–53
Google Scholar
Ren Z, Yang L (2019) Robust extreme learning machines with different loss functions. Neural Process Lett 49(3):1543–1565
Google Scholar
Shen J, Ma J (2019) Sparse twin extreme learning machine with epsilon-insensitive zone pinball loss. IEEE Access 7:112067–112078
Google Scholar
Huang Z, Li J (2022) Discernibility measures for fuzzy $\beta $ covering and their application. IEEE Trans Cybernet 52(9):9722–9735
Google Scholar
Huang Z, Li J, Qian Y (2022) Noise-tolerant fuzzy-$\beta $-covering-based multigranulation rough sets and feature subset selection. IEEE Trans Fuzzy Syst 30(7):2721–2735
Google Scholar
Huang, Z., Li, J.: Noise-tolerant discrimination indexes for fuzzy $\gamma $ covering and feature subset selection. IEEE Trans Neural Netw Learn Syst (Early Access)
Lin C, Wang S (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471
Google Scholar
Zhang W, Ji H (2013) Fuzzy extreme learning machine for classification. Electron Lett 49(7):448–449
Google Scholar
Shen X, Niu L, Qi Z, Tian Y (2017) Support vector machine classifier with truncated pinball loss. Pattern Recogn 68:199–210
Google Scholar
Wang H, Xu Y, Zhou Z (2021) Twin-parametric margin support vector machine with truncated pinball loss. Neural Comput Appl 33(8):3781–3798
Google Scholar
Yuille A, Rangarajan A (2003) The concave-convex procedure. Neural Comput 15(4):915–936
Google Scholar
Lipp T, Boyd S (2016) Variations and extension of the convex-concave procedure. Optim Eng 17(2):263–287
MathSciNet Google Scholar
Huang G, Zhu Q, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501
Google Scholar
Rezvani S, Wang X, Pourpanah F (2019) Intuitionistic fuzzy twin support vector machines. IEEE Trans Fuzzy Syst 27(11):2140–2151
Google Scholar
Liang Z, Zhang L (2022) Intuitionistic fuzzy twin support vector machines with the insensitive pinball loss. Appl Soft Comput 115:108231
Google Scholar
Laxmi S, Gupta SK (2022) Multi-category intuitionistic fuzzy twin support vector machines with an application to plant leaf recognition. Eng Appl Artif Intell 110:104687
Google Scholar
Wong CM, Vong CM, Wong PK, Cao J (2018) Kernel-based multilayer extreme learning machines for representation learning. IEEE Trans Neural Netw Learn Syst 29(3):757–762
MathSciNet Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of Liaoning Province in China (2020-MS-281). We thank all anonymous reviewers for their helpful comments, which improved the quality of this paper.

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
Qingyun Gao & Qing Ai
College of Information Science and Engineering, Northeastern University, Shenyang, 110819, China
Wenhui Wang

Authors

Qingyun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qing Ai
View author publications
You can also search for this author in PubMed Google Scholar
Wenhui Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: QG, QA; Methodology: QG, QA; Writing-original draft preparation: QG; Writing-review and editing: QG, QA; Funding acquisition: QA; Supervision: QA, WW.

Corresponding author

Correspondence to Qing Ai.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

All authors contributed to the conception and design of the study. All authors read and approved the final manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Process of Obtaining the Dual Problem of (20)

In this section, we focus on solving the Problem (20). For clarity, the iteration superscript $k-1$ is removed. By introducing the Lagrangian multipliers $\alpha $ and $\theta $, the Lagrangian function of the original problem (20) is

$$\begin{aligned} L\left( \beta ,\xi ,\alpha ,\theta \right)&=\frac{1}{2}\left\| {\beta } \right\| ^2+c{\xi }^{\textrm{T}}S\nonumber \\&\quad +{\delta }^{\textrm{T}} TH\beta -{\alpha }^{\textrm{T}} \left( TH\beta -e+\frac{1}{1+\tau }\xi \right) -{\theta }^{\textrm{T}}\xi , \end{aligned}$$

(A1)

where $S=\left[ s_1,\ldots ,s_N\right] ^{\textrm{T}}$, $\delta =\left[ \delta _1,\ldots ,\delta _N\right] ^{\textrm{T}}$, $T=\left[ \begin{array}{ccc} t_1&{}&{}\\ &{}\ddots &{}\\ &{}&{}t_N\\ \end{array}\right] $, $H=\left[ h\left( x_1\right) ,\ldots ,h\left( x_N\right) \right] ^{\textrm{T}}\in \mathfrak {R}^{N\times L}$, $e=\left[ 1,\ldots ,1\right] ^{\textrm{T}}\in \mathfrak {R}^N$, $\alpha =\left[ \alpha _1,\ldots ,\alpha _N\right] ^{\textrm{T}}$ and $\theta =\left[ \theta _1,\ldots ,\theta _N\right] ^{\textrm{T}}$ are the Lagrangian multiplier vectors.

According to the KKT conditions, we can obtain

$$\begin{aligned}&\nabla _\beta L = \beta +{H}^{\textrm{T}}T\delta - {H}^{\textrm{T}}T\alpha =0, \end{aligned}$$

(A2)

$$\begin{aligned}&\nabla _\xi L=cS-\frac{\ \alpha }{1+\tau }-\ \theta =0,\ \end{aligned}$$

(A3)

$$\begin{aligned}&{\alpha }^{\textrm{T}}\left( \ TH\beta -e+\frac{1}{1+\tau }\xi \right) =0,\ \alpha \ge 0,\ \end{aligned}$$

(A4)

$$\begin{aligned}&t_ih\left( x_i\right) \beta \ge \ 1-\ {\frac{1}{1+\tau }\xi _i},\ \xi _i\ge 0, \end{aligned}$$

(A5)

$$\begin{aligned}&{\theta }^{\textrm{T}}\xi =0,\ \theta \ge 0. \end{aligned}$$

(A6)

From (A2), we can obtain

$$\begin{aligned} \beta ={H}^{\textrm{T}}T\left( \alpha -\delta \right) . \end{aligned}$$

(A7)

According to Eq. (A3) and Eq. (A6), we can derive

$$\begin{aligned} 0\le \alpha \le \left( 1+\tau \right) cS. \end{aligned}$$

(A8)

By substituting Eq. (A3) and Eq. (A7) into Eq. (A1), we can obtain the following dual problem.

$$\begin{aligned}&min_\alpha {\ \frac{1}{2}({\alpha }^{\textrm{T}}-{\delta }^{\textrm{T}}) Q(\alpha -\delta )-{e}^{\textrm{T}}\alpha }\\&s.t.\ \ 0\le \ \alpha \le \left( 1+\tau \right) cS, \end{aligned}$$

where $Q=TH{H}^{\textrm{T}}T$.

Appendix B Additional Experiments

We present the experimental results in the noise-free environment and 50% label noise environment. The best results for each dataset are shown in bold.

Table 7 ACC of nine algorithms in the noise-free environment

Full size table

Table 8 AUC of nine algorithms in the noise-free environment

Full size table

Table 9 $F_1$ of nine algorithms in the noise-free environment

Full size table

Table 10 ACC of nine algorithms in the 50% label noise environment

Full size table

Table 11 AUC of nine algorithms in the 50% label noise environment

Full size table

Table 12 $F_1$ of nine algorithms in the 50% label noise environment

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, Q., Ai, Q. & Wang, W. Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss. Neural Process Lett 56, 116 (2024). https://doi.org/10.1007/s11063-024-11492-5

Download citation

Accepted: 17 October 2023
Published: 20 March 2024
DOI: https://doi.org/10.1007/s11063-024-11492-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss

Abstract

Similar content being viewed by others

A New Fuzzy Extreme Learning Machine for Regression Problems with Outliers or Noises

Intuitionistic fuzzy broad learning system with a new non-membership function

Fuzzy One-Class Extreme Auto-encoder

Explore related subjects

1 Introduction

2 Related Works

2.1 Notations

2.2 ELM and its Loss Functions

2.3 FELM and Its Improvements

3 Intuitionistic Fuzzy Extreme Learning Machines with the Truncated Pinball Loss

3.1 Intuitionistic Fuzzy Settings

3.1.1 Intuitionistic Fuzzy Membership Degree

Example 1

3.1.2 Intuitionistic Fuzzy Non-membership Degree

Example 2

3.1.3 The Score Function

3.2 Linear Case

3.3 Nonlinear Case

3.4 The Discussion

4 Properties of the TPin-IFELM

4.1 Noise Insensitivity and Sparsity

4.2 Weight Scatter and Misclassification Error Minimization

5 Experiments

5.1 Experimental Configuration

5.2 Experiments on the Artificial Dataset

5.3 Experiments on the Benchmark Datasets

5.4 Statistical Analysis

5.5 Sensitivity Analysis

5.5.1 Methods for Obtaining Sample Structure Information

5.5.2 Parameters c and L

5.5.3 Parameters \(\tau \), \(\varsigma \) and K

6 Conclusion

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix A Process of Obtaining the Dual Problem of (20)

Appendix B Additional Experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation