Exploiting Conjugate Label Information for Multi-Instance
Partial-Label Learning

Wei Tang1,2    Weijia Zhang3    Min-Ling Zhang1,2
1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
2Key Lab. of Computer Network and Information Integration (Southeast University), MoE, China
3School of Information and Physical Sciences, The University of Newcastle, NSW 2308, Australia
[email protected], [email protected], [email protected]
Corresponding author
Abstract

Multi-instance partial-label learning (MIPL) addresses scenarios where each training sample is represented as a multi-instance bag associated with a candidate label set containing one true label and several false positives. Existing MIPL algorithms have primarily focused on mapping multi-instance bags to candidate label sets for disambiguation, disregarding the intrinsic properties of the label space and the supervised information provided by non-candidate label sets. In this paper, we propose an algorithm named EliMipl, i.e., Exploiting conjugate Label Information for Multi-Instance Partial-Label learning, which exploits the conjugate label information to improve the disambiguation performance. To achieve this, we extract the label information embedded in both candidate and non-candidate label sets, incorporating the intrinsic properties of the label space. Experimental results obtained from benchmark and real-world datasets demonstrate the superiority of the proposed EliMipl over existing MIPL algorithms and other well-established partial-label learning algorithms.

1 Introduction

Weakly supervised learning has emerged as a powerful strategy in scenarios with limited annotated data. Based on label quality and quantity, weak supervision can be broadly categorized into three types: inaccurate, inexact, and incomplete supervision Zhou (2018). Inexact supervision refers to a coarse correspondence between instances and labels. To work with inexact supervision, these are two prevalent learning paradigms, i.e., multi-instance learning (MIL) Amores (2013); Carbonneau et al. (2018); Ilse et al. (2018); Zhang et al. (2022c, b) and partial-label learning (PLL) Cour et al. (2011); Lyu et al. (2020); Zhang et al. (2022a); He et al. (2022); Gong et al. (2022); Li et al. (2023). In MIL, a sample is represented as a bag of instances and associated with a single bag-level label, while the instance-level labels are inaccessible to the learner. In PLL, a sample is represented as a single instance and linked to a candidate label set, including one true label and multiple false positives. Therefore, MIL and PLL can be perceived as two sides of the same coin: inexact supervision within MIL manifests in the instance space, whereas inexact supervision appears in the label space within PLL.

\begin{overpic}[width=241.84842pt]{./figs/fig1.pdf} \put(480.0,470.0){\scriptsize$k_{1}$} \put(513.0,470.0){\scriptsize$k_{2}$} \put(548.0,470.0){\scriptsize$k_{3}$} \put(583.0,470.0){\scriptsize$k_{4}$} \put(617.0,470.0){\scriptsize$k_{5}$} \put(652.0,470.0){\scriptsize$k_{6}$} \put(687.0,470.0){\scriptsize$k_{7}$} \put(298.0,540.0){\scriptsize crowd-sourced} \put(292.0,510.0){\scriptsize candidate labels} \put(822.0,666.0){\scriptsize ground-truth} \put(822.0,635.0){\scriptsize labels} \put(822.0,599.0){\scriptsize false positive} \put(822.0,569.0){\scriptsize labels} \put(822.0,537.0){\scriptsize non-candidate} \put(822.0,505.0){\scriptsize labels} \put(822.0,472.0){\scriptsize zero entries} \put(365.0,425.0){\small{(a)}} \put(838.0,425.0){\small legends} \put(-8.0,240.0){\small$m$} \put(240.0,240.0){\small$=$} \put(487.0,240.0){\small$+$} \put(732.0,240.0){\small$+$} \put(130.0,143.0){\small\rotatebox{90.0}{$\cdots$}} \put(382.0,143.0){\small\rotatebox{90.0}{$\cdots$}} \put(608.0,143.0){\small\rotatebox{90.0}{$\cdots$}} \put(878.0,143.0){\small\rotatebox{90.0}{$\cdots$}} \put(91.0,60.0){\small$k=7$} \put(67.0,20.0){\small complete} \put(30.0,-20.0){\small label matrix {\scriptsize{$\boldsymbol{Y}$}}} \put(322.0,20.0){\small candidate label matrix {\scriptsize{$\boldsymbol{S}$}}% } \put(779.0,60.0){\small non-candidate} \put(779.0,20.0){\small label matrix {\scriptsize{$\boldsymbol{\bar{S}}$}}} \put(370.0,65.0){\small{\scriptsize{$\boldsymbol{Y_{F}}$}}} \put(526.0,65.0){\small{\scriptsize{$\boldsymbol{Y_{T}}$}} (\emph{sparse})} \put(486.0,-32.0){\small{(b)}} \end{overpic}
Figure 1: (a) A multi-instance bag is labeled with a candidate label set 𝒮={k1,k2,k5,k7}𝒮subscript𝑘1subscript𝑘2subscript𝑘5subscript𝑘7\mathcal{S}=\{k_{1},k_{2},k_{5},k_{7}\}caligraphic_S = { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT }. (b) The decomposition of the complete label matrix, where m𝑚mitalic_m and k𝑘kitalic_k represent the number of multi-instance bags and categories, respectively.

However, many tasks exhibit a phenomenon of dual inexact supervision, where ambiguity arises in both instance and label spaces. To work with the dual inexact supervision, Tang et al. Tang et al. (2024) introduced a learning paradigm known as multi-instance partial-label learning (MIPL) and developed a Gaussian Processes-based algorithm (MiplGp), which derives a bag-level predictor by aggregating predictions of all instances within the same bag. To capture global representations for multi-instance bags, an algorithm named DeMipl equipped with an attention mechanism is introduced Tang et al. (2023). The existing algorithms mainly operate in the instance space and only utilize the candidate label information.

The non-candidate label set holds crucial roles in MIPL. In histopathological image classification, images are commonly segmented into patches Campanella et al. (2019), and their labels may come from crowd-sourced annotators rather than expert pathologists Grote et al. (2019). Figure 1(a) illustrates that crowd-sourced annotators treat an image as a multi-instance bag 𝑿i={𝒙i,1,𝒙i,2,,𝒙i,9}subscript𝑿𝑖subscript𝒙𝑖1subscript𝒙𝑖2subscript𝒙𝑖9\boldsymbol{X}_{i}=\{\boldsymbol{x}_{i,1},\boldsymbol{x}_{i,2},\cdots,% \boldsymbol{x}_{i,9}\}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_i , 9 end_POSTSUBSCRIPT } and provide a candidate label set 𝒮i={k1,k2,k5,k7}subscript𝒮𝑖subscript𝑘1subscript𝑘2subscript𝑘5subscript𝑘7\mathcal{S}_{i}=\{k_{1},k_{2},k_{5},k_{7}\}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT }, whose candidate label matrix can be written as 𝑺i=[1,1,0,0,1,0,1]subscript𝑺𝑖1100101\boldsymbol{S}_{i}=[1,1,0,0,1,0,1]bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ 1 , 1 , 0 , 0 , 1 , 0 , 1 ]. Similarly, the non-candidate label set 𝒮¯i={k3,k4,k6}subscript¯𝒮𝑖subscript𝑘3subscript𝑘4subscript𝑘6\mathcal{\bar{S}}_{i}=\{k_{3},k_{4},k_{6}\}over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT } corresponds to the non-candidate label matrix 𝑺¯i=[0,0,1,1,0,1,0]subscriptbold-¯𝑺𝑖0011010\boldsymbol{\bar{S}}_{i}=[0,0,1,1,0,1,0]overbold_¯ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ 0 , 0 , 1 , 1 , 0 , 1 , 0 ], indicating that 𝑿isubscript𝑿𝑖\boldsymbol{X}_{i}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT must not belong to categories k3subscript𝑘3k_{3}italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, k4subscript𝑘4k_{4}italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, or k6subscript𝑘6k_{6}italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT. Therefore, we can extract exact supervision from the non-candidate label set. As depicted in Figure 1(b), we decompose a complete label matrix 𝒀𝒀\boldsymbol{Y}bold_italic_Y into a candidate label matrix 𝑺𝑺\boldsymbol{S}bold_italic_S and a non-candidate label matrix 𝑺¯bold-¯𝑺\boldsymbol{\bar{S}}overbold_¯ start_ARG bold_italic_S end_ARG. Subsequently, 𝑺𝑺\boldsymbol{S}bold_italic_S may be further disintegrated into a false positive label matrix 𝒀𝑭subscript𝒀𝑭\boldsymbol{Y_{F}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_F end_POSTSUBSCRIPT and a true label matrix 𝒀𝑻subscript𝒀𝑻\boldsymbol{Y_{T}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT, i.e., 𝒀=𝑺+𝑺¯=𝒀𝑭+𝒀𝑻+𝑺¯𝒀𝑺bold-¯𝑺subscript𝒀𝑭subscript𝒀𝑻bold-¯𝑺\boldsymbol{Y}=\boldsymbol{S}+\boldsymbol{\bar{S}}=\boldsymbol{Y_{F}}+% \boldsymbol{Y_{T}}+\boldsymbol{\bar{S}}bold_italic_Y = bold_italic_S + overbold_¯ start_ARG bold_italic_S end_ARG = bold_italic_Y start_POSTSUBSCRIPT bold_italic_F end_POSTSUBSCRIPT + bold_italic_Y start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT + overbold_¯ start_ARG bold_italic_S end_ARG. Notably, 𝒀𝑻subscript𝒀𝑻\boldsymbol{Y_{T}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT is sparse, as each row must have one and only one non-zero element. However, the current MIPL algorithms have predominantly concentrated on the mappings from multi-instance bags to 𝑺𝑺\boldsymbol{S}bold_italic_S, neglecting the sparsity of 𝒀𝑻subscript𝒀𝑻\boldsymbol{Y_{T}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT and the information from 𝑺¯bold-¯𝑺\boldsymbol{\bar{S}}overbold_¯ start_ARG bold_italic_S end_ARG.

Refer to caption
Figure 2: Predicted probabilities of DeMipl (left) and EliMipl (right) on the sample in CRC-MIPL-Row dataset.

Consequently, Figure 2 illustrates the predicted probabilities on the true label, along with the average predicted probabilities on each candidate label and non-candidate label. The left side depicts the probabilities of the DeMipl, revealing proximity in the average predicted probabilities on candidate and non-candidate labels. This observation indicates that DeMipl encounters difficulty in effectively discerning between candidate and non-candidate labels. To address this challenge, we introduce the concept of conjugate label information (CLI), encapsulating information from both candidate and non-candidate label sets, along with the sparsity of the true label matrix. The right side in Figure 2 shows the predicted probabilities when exploiting the CLI. It is evident that (a) the predicted probabilities on the true label exhibit a noticeable increase, (b) the average predicted probabilities on the non-candidate label are reduced, and (c) the average probabilities on each candidate label and non-candidate label are distinctly separated. This suggests that the CLI conduce to train a more discriminative MIPL classifier.

In this paper, we present an algorithm named EliMipl, i.e., Exploiting conjugate Label Information for Multi-Instance Partial-Label learning. Firstly, we introduce a scaled additive attention mechanism to aggregate each multi-instance bag into a bag-level feature representation. Secondly, to enhance the utilization of candidate label information, we leverage the mappings from the bag-level features to the candidate label sets, coupled with the sparsity of the candidate label matrix. Lastly, to incorporate the non-candidate label information, we propose an inhibition loss to diminish the model’s predictions on the non-candidate labels. To the best of our knowledge, we are the first to introduce the scaled additive attention mechanism and the CLI in MIPL. Extensive experimental results demonstrate that EliMipl outperforms the state-of-the-art MIPL algorithms and the PLL algorithms.

The remainder is organized as follows. Firstly, we review related work in Section 2. Secondly, we present the proposed EliMipl in Section 3 and report the experimental results in Section 4. Lastly, we conclude this paper in Section 5.

\begin{overpic}[width=398.33858pt]{./figs/fig3.pdf} \put(-35.0,232.0){\small multi-instance bag $\boldsymbol{X}_{i}$} \put(235.0,310.0){\small$\psi(\cdot)$} \put(180.0,232.0){\small feature extractor} \put(360.0,232.0){\small instance-level features $\boldsymbol{H}_{i}$} \put(620.0,232.0){\small scaled additive attention} \put(855.0,245.0){\small bag-level feature $\boldsymbol{z}_{i}$} \put(-30.0,120.0){\small candidate label set $\mathcal{S}_{i}$} \put(-50.0,12.0){\small non-candidate label set $\mathcal{\bar{S}}_{i}$} \put(218.0,66.0){\small$\mathcal{L}_{\text{in}}$} \put(218.0,120.0){\small$\mathcal{L}_{\text{sp}}$} \put(218.0,173.0){\small$\mathcal{L}_{\text{ma}}$} \put(324.0,120.0){\small probabilities on $\mathcal{S}_{i}$} \put(324.0,12.0){\small probabilities on $\mathcal{\bar{S}}_{i}$} \put(533.0,66.0){\small probabilities $\boldsymbol{P}_{i}$} \put(795.0,45.0){\small classifier} \end{overpic}
Figure 3: The pipeline of EliMipl, where masubscriptma\mathcal{L}_{\text{ma}}caligraphic_L start_POSTSUBSCRIPT ma end_POSTSUBSCRIPT, spsubscriptsp\mathcal{L}_{\text{sp}}caligraphic_L start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT, and insubscriptin\mathcal{L}_{\text{in}}caligraphic_L start_POSTSUBSCRIPT in end_POSTSUBSCRIPT refer to mapping loss, sparsity loss, and inhibition loss, respectively.

2 Related Work

2.1 Multi-Instance Learning

Originating from drug activity prediction Dietterich et al. (1997), MIL has found extensive adoption in diverse applications, including text classification Zhou et al. (2009); Zhang (2021) and image annotation Wang et al. (2018). Contemporary deep MIL approaches predominantly rely on attention mechanisms Wang et al. (2022b); Chen et al. (2022); Tan et al. (2023). Ilse et al. Ilse et al. (2018) introduced attention mechanisms to aggregate each multi-instance bag into a feature vector. For multi-classification tasks, Shi et al. Shi et al. (2020) proposed a loss-based attention mechanism to learn instance-level weights, predictions, and bag-level predictions. Furthermore, researchers have explored the intrinsic attributes of attention mechanisms to improve performance Cui et al. (2023); Xiang et al. (2023). While these approaches achieve promising results in cases with exact bag-level labels, they face challenges in learning from ambiguous bag-level labels.

2.2 Partial-Label Learning

Recent PLL approaches heavily rely on deep learning techniques. Yao et al. Yao et al. (2020) employed deep convolutional neural networks for feature extraction and utilized the exponential moving average technique to uncover latent true labels. Building on the empirical risk minimization principle, Lv et al. Lv et al. (2020) devised a classifier-consistent risk estimator that progressively identifies true labels. Similarly, Feng et al. Feng et al. (2020) delved into the generation process of partial-labeled data, proposing both a risk-consistent approach and a classifier-consistent approach. Taking a more generalized stance, Wen et al. Wen et al. (2021) presented a weighted loss function capable of accommodating various methods through distinct weight assignments. Furthermore, Wu et al. Wu et al. (2022) proposed a supervised loss to constrain outputs on non-candidate labels, coupled with consistency regularization on candidate labels. While the supervised loss bears resemblance to our inhibition loss, our proposed CLI loss incorporates additional components, namely the mapping loss and the sparse loss. Although these methods effectively learn from partial-labeled data, they lack the capability to manage multi-instance bags.

2.3 Multi-Instance Partial-Label Learning

In contrast to the inherent limitations of addressing only unilateral inexact supervision in MIL and PLL, MIPL possesses the capability to work with dual inexact supervision. To the best of our knowledge, there are only two viable MIPL algorithms. Tang et al. Tang et al. (2024) is the first to introduce the framework of MIPL along with a Gaussian processes-based algorithm (MiplGp), which follows an instance-space paradigm. MiplGp begins by augmenting a negative class for each candidate label set, subsequently treating the candidate label set of each multi-instance bag as that of each instance within the bag. Finally, it employs the Dirichlet disambiguation strategy and the Gaussian processes regression model for disambiguation. Differing from MiplGp, DeMipl follows the embedded-space paradigm and aggregates each multi-instance bag into a feature representation and employs a momentum-based disambiguation strategy to find true labels from candidate label sets Tang et al. (2023). However, both methods primarily depend on mapping from instances or multi-instance bags to candidate label sets for disambiguation, without considering the proposed CLI in this paper.

3 Methodology

3.1 Preliminaries

In this study, we define a MIPL training dataset as 𝒟={(𝑿i,𝒮i)1im}𝒟conditional-setsubscript𝑿𝑖subscript𝒮𝑖1𝑖𝑚\mathcal{D}=\{(\boldsymbol{X}_{i},\mathcal{S}_{i})\mid 1\leq i\leq m\}caligraphic_D = { ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ 1 ≤ italic_i ≤ italic_m }, comprising m𝑚mitalic_m multi-instance bags and their corresponding candidate label sets. Specifically, a candidate label set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of one true label and multiple false positive labels, but the true label is unknown. It is crucial to note that a bag contains at least one instance pertaining to the true label, while excluding any instances corresponding to false positive labels. The instance space is denoted as 𝒳d𝒳superscript𝑑\mathcal{X}\in\mathbb{R}^{d}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, while the label space 𝒴={1,2,,k}𝒴12𝑘\mathcal{Y}=\{1,2,\cdots,k\}caligraphic_Y = { 1 , 2 , ⋯ , italic_k } encompasses k𝑘kitalic_k class labels. The i𝑖iitalic_i-th bag 𝑿i={𝒙i,1,𝒙i,2,,𝒙i,ni}subscript𝑿𝑖subscript𝒙𝑖1subscript𝒙𝑖2subscript𝒙𝑖subscript𝑛𝑖\boldsymbol{X}_{i}=\{\boldsymbol{x}_{i,1},\boldsymbol{x}_{i,2},\cdots,% \boldsymbol{x}_{i,n_{i}}\}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } comprises nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT instances of dimension d𝑑ditalic_d. Both the candidate label set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the non-candidate label set 𝒮¯isubscript¯𝒮𝑖\bar{\mathcal{S}}_{i}over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are proper subsets of the label space 𝒴𝒴\mathcal{Y}caligraphic_Y, satisfying the conditions |𝒮i|+|𝒮¯i|=|𝒴|=ksubscript𝒮𝑖subscript¯𝒮𝑖𝒴𝑘|\mathcal{S}_{i}|+|\bar{\mathcal{S}}_{i}|=|\mathcal{Y}|=k| caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + | over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = | caligraphic_Y | = italic_k, where |||\cdot|| ⋅ | denotes the cardinality of a set.

The pipeline of the proposed EliMipl is depicted in Figure 3, which contains three main components: an instance-level feature extractor, a scaled additive attention mechanism, and a classifier. When presented with a multi-instance bag 𝑿isubscript𝑿𝑖\boldsymbol{X}_{i}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along with its associated candidate label set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and non-candidate label set 𝒮¯isubscript¯𝒮𝑖\bar{\mathcal{S}}_{i}over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we initially employ a feature extractor to procure instance-level feature representations. Subsequently, the scaled additive attention mechanism is applied to aggregate a bag of instances into a unified bag-level feature representation. Finally, the classifier is invoked to estimate the class probabilities based on the bag-level features. To utilize the CLI, we introduce a mapping loss masubscriptma\mathcal{L}_{\text{ma}}caligraphic_L start_POSTSUBSCRIPT ma end_POSTSUBSCRIPT and a sparsity loss spsubscriptsp\mathcal{L}_{\text{sp}}caligraphic_L start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT to disambiguate the candidate label sets, along with an inhibition loss insubscriptin\mathcal{L}_{\text{in}}caligraphic_L start_POSTSUBSCRIPT in end_POSTSUBSCRIPT to suppress the model’s prediction over the non-candidate label sets.

3.2 Instance-Level Feature Extractor

For a given multi-instance bag 𝑿i={𝒙i,1,𝒙i,2,,𝒙i,ni}subscript𝑿𝑖subscript𝒙𝑖1subscript𝒙𝑖2subscript𝒙𝑖subscript𝑛𝑖\boldsymbol{X}_{i}=\{\boldsymbol{x}_{i,1},\boldsymbol{x}_{i,2},\cdots,% \boldsymbol{x}_{i,n_{i}}\}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } with nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT instances, instance-level feature representations 𝑯isubscript𝑯𝑖\boldsymbol{H}_{i}bold_italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are learned using a feature extractor ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) as follows:

𝑯i=ψ(𝑿i)={𝒉i,1,𝒉i,2,,𝒉i,ni},subscript𝑯𝑖𝜓subscript𝑿𝑖subscript𝒉𝑖1subscript𝒉𝑖2subscript𝒉𝑖subscript𝑛𝑖\boldsymbol{H}_{i}=\psi(\boldsymbol{X}_{i})=\{\boldsymbol{h}_{i,1},\boldsymbol% {h}_{i,2},\cdots,\boldsymbol{h}_{i,n_{i}}\},bold_italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ψ ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { bold_italic_h start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } , (1)

where 𝒉i,jlsubscript𝒉𝑖𝑗superscript𝑙\boldsymbol{h}_{i,j}\in\mathbb{R}^{l}bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT indicates the feature representation of the j𝑗jitalic_j-th instance within the i𝑖iitalic_i-th multi-instance bag, and ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) is a neural network comprised of two components, i.e., ψ(𝑿i)=ψ2(ψ1(𝑿i))𝜓subscript𝑿𝑖subscript𝜓2subscript𝜓1subscript𝑿𝑖\psi(\boldsymbol{X}_{i})=\psi_{2}(\psi_{1}(\boldsymbol{X}_{i}))italic_ψ ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). Here, ψ1()subscript𝜓1\psi_{1}(\cdot)italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) is a feature extractor that can be tailored to the specific characteristics of the datasets, and ψ2()subscript𝜓2\psi_{2}(\cdot)italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) is composed of fully connected layers that map instance-level features to an embedded space of dimension l𝑙litalic_l.

3.3 Scaled Additive Attention Mechanism

To aggregate instance-level features into bag-level representations, we introduce a scaled additive attention mechanism specifically designed for MIPL. The existing attention mechanism for MIPL utilizes the sigmoid function for calculating attention scores, followed by normalization Tang et al. (2023). The attention scores derived through the sigmoid function are constrained within the range (0,1)01(0,1)( 0 , 1 ), leading to a limited distinction between instances. Therefore, we introduce an additive attention mechanism calculating attention scores by the softmax function to distinguish instances, equipped with a scaling factor to prevent vanishing gradients Vaswani et al. (2017). Specifically, we first denote the output of the additive attention mechanism as ξ(hi,j)𝜉subscript𝑖𝑗\xi(h_{i,j})italic_ξ ( italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ), quantifying the impact of the j𝑗jitalic_j-th instance on the i𝑖iitalic_i-th bag as follows:

ξ(hi,j)=𝑾(tanh(𝑾t𝒉i,j+𝒃t)sigm(𝑾s𝒉i,j+𝒃s)),𝜉subscript𝑖𝑗superscript𝑾topdirect-producttanhsuperscriptsubscript𝑾𝑡topsubscript𝒉𝑖𝑗subscript𝒃𝑡sigmsuperscriptsubscript𝑾𝑠topsubscript𝒉𝑖𝑗subscript𝒃𝑠\xi(h_{i,j})=\boldsymbol{W}^{\top}(\text{tanh}(\boldsymbol{W}_{t}^{\top}% \boldsymbol{h}_{i,j}+\boldsymbol{b}_{t})\odot\text{sigm}(\boldsymbol{W}_{s}^{% \top}\boldsymbol{h}_{i,j}+\boldsymbol{b}_{s})),italic_ξ ( italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( tanh ( bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊙ sigm ( bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) , (2)

where 𝑾superscript𝑾top\boldsymbol{W}^{\top}bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝑾tsuperscriptsubscript𝑾𝑡top\boldsymbol{W}_{t}^{\top}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝑾ssuperscriptsubscript𝑾𝑠top\boldsymbol{W}_{s}^{\top}bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝒃tsubscript𝒃𝑡\boldsymbol{b}_{t}bold_italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and 𝒃ssubscript𝒃𝑠\boldsymbol{b}_{s}bold_italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are learnable parameters. tanh()tanh\text{tanh}(\cdot)tanh ( ⋅ ) and sigm()sigm\text{sigm}(\cdot)sigm ( ⋅ ) are the hyperbolic tangent and sigmoid functions, respectively. The operator direct-product\odot denotes element-wise multiplication. Then, we normalize ξ(hi,j)𝜉subscript𝑖𝑗\xi(h_{i,j})italic_ξ ( italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) using softmax with a scaling factor 1/l1𝑙1/\sqrt{l}1 / square-root start_ARG italic_l end_ARG to derive the attention score:

ai,j=exp(ξ(hi,j)/l)j=1niexp(ξ(hi,j)/l),a_{i,j}=\frac{\exp\left(\xi\left(h_{i,j}\right)/\sqrt{l}\right)}{\sum_{j{{}^{% \prime}}=1}^{n_{i}}\exp\left(\xi\left(h_{i,j^{\prime}}\right)/\sqrt{l}\right)},italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_ξ ( italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) / square-root start_ARG italic_l end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_ξ ( italic_h start_POSTSUBSCRIPT italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) / square-root start_ARG italic_l end_ARG ) end_ARG , (3)

where ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents the attention score of the j𝑗jitalic_j-th instance in the i𝑖iitalic_i-th bag. Finally, we consolidate the instance-level features into a bag-level representation, as demonstrated below:

𝒛i=j=1niai,j𝒉i,j,subscript𝒛𝑖superscriptsubscript𝑗1subscript𝑛𝑖subscript𝑎𝑖𝑗subscript𝒉𝑖𝑗\boldsymbol{z}_{i}=\sum_{j=1}^{n_{i}}a_{i,j}\boldsymbol{h}_{i,j},bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (4)

where 𝒛isubscript𝒛𝑖\boldsymbol{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the bag-level representation of the i𝑖iitalic_i-th multi-instance bag. The bag-level representations of all multi-instance bags in the training dataset are denoted by 𝒵𝒵\mathcal{Z}caligraphic_Z.

3.4 Conjugate Label Information

Candidate Label Information.

Once the bag-level feature representations are acquired, the subsequent task is to disambiguate the candidate label set. The disambiguation entails establishing the mapping relationship from the bag-level features to their corresponding candidate label set. The goal of precise mapping is to guide the classifier to assign higher class probabilities to true labels and lower probabilities to false positive labels. To attain this objective, we employ a weighted mapping loss function:

ma(𝒵,𝒮)=1mi=1mc𝒮iwi,c(t)log(fc(𝒛i)),subscriptma𝒵𝒮1𝑚superscriptsubscript𝑖1𝑚subscript𝑐subscript𝒮𝑖superscriptsubscript𝑤𝑖𝑐𝑡subscript𝑓𝑐subscript𝒛𝑖\mathcal{L}_{\text{ma}}(\mathcal{Z},\mathcal{S})=-\frac{1}{m}\sum_{i=1}^{m}% \sum_{c\in\mathcal{S}_{i}}w_{i,c}^{(t)}\log(f_{c}(\boldsymbol{z}_{i})),caligraphic_L start_POSTSUBSCRIPT ma end_POSTSUBSCRIPT ( caligraphic_Z , caligraphic_S ) = - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT roman_log ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (5)

where f𝑓fitalic_f is the classifier, and fc()subscript𝑓𝑐f_{c}(\cdot)italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ) represents the classifier’s prediction probability for the candidate label c𝑐citalic_c. wi,c(t)superscriptsubscript𝑤𝑖𝑐𝑡w_{i,c}^{(t)}italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT denotes the weight assigned to the prediction of the c𝑐citalic_c-th class at the t𝑡titalic_t-th epoch, using the features of the i𝑖iitalic_i-th bag as input for the classifier. For candidate labels, we initialize wi,c(0)=1|𝒮i|superscriptsubscript𝑤𝑖𝑐01subscript𝒮𝑖w_{i,c}^{(0)}=\frac{1}{\left|\mathcal{S}_{i}\right|}italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG through an averaging approach. During training, we update wi,c(t)superscriptsubscript𝑤𝑖𝑐𝑡w_{i,c}^{(t)}italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT by computing a weighted sum of the classifier’s outputs at both the previous epoch and current epoch as follows:

wi,c(t)=ρ(t)wi,c(t1)+(1ρ(t))fc(𝒛i)c𝒮ifc(𝒛i),superscriptsubscript𝑤𝑖𝑐𝑡superscript𝜌𝑡superscriptsubscript𝑤𝑖𝑐𝑡11superscript𝜌𝑡subscript𝑓𝑐subscript𝒛𝑖subscriptsuperscript𝑐subscript𝒮𝑖subscript𝑓superscript𝑐subscript𝒛𝑖w_{i,c}^{(t)}=\rho^{(t)}w_{i,c}^{(t-1)}+(1-\rho^{(t)})\frac{f_{c}(\boldsymbol{% z}_{i})}{\sum_{c^{\prime}\in\mathcal{S}_{i}}f_{c^{\prime}}(\boldsymbol{z}_{i})},italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_ρ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + ( 1 - italic_ρ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) divide start_ARG italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , (6)

where ρ(t)=(Tt)/Tsuperscript𝜌𝑡𝑇𝑡𝑇\rho^{(t)}={(T-t)}/{T}italic_ρ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( italic_T - italic_t ) / italic_T is dynamically adjusted across epochs, and T𝑇Titalic_T is the maximum of the training epochs.

While the mapping loss can assess the relative labeling probabilities of candidate labels, it fails to capture the mutually exclusive relationships among the candidate labels. To address this issue in PLL, Feng et al. Feng and An (2019) introduced the maximum infinity norm on the predicted probabilities of all classes and alternately optimize the maximum infinity norm by solving k𝑘kitalic_k independent quadratic programming problems. However, as depicted in Figure 1(b), we observe that each row of the true label matrix exhibits sparsity. Although the true labels remain inaccessible during the training process, we encourage the classifier to generate sparse prediction probabilities for the candidate labels. Specifically, the goal is to push the prediction probability of the unknown true label toward 1111 while simultaneously driving the prediction probabilities of other candidate labels toward 00. Therefore, we directly capture the mutually exclusive relationships among the candidate labels by implementing the sparsity loss, as detailed below:

sp(𝒮)=1mi=1m𝑷i𝑺i0,subscriptsp𝒮1𝑚superscriptsubscript𝑖1𝑚subscriptnormdirect-productsubscript𝑷𝑖subscript𝑺𝑖0\mathcal{L}_{\text{sp}}(\mathcal{S})=\frac{1}{m}\sum_{i=1}^{m}\|\boldsymbol{P}% _{i}\odot\boldsymbol{S}_{i}\|_{0},caligraphic_L start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT ( caligraphic_S ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (7)

where 𝑷isubscript𝑷𝑖\boldsymbol{P}_{i}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝑺isubscript𝑺𝑖\boldsymbol{S}_{i}bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the prediction probabilities and the candidate label set matrix of the i𝑖iitalic_i-th bag, respectively. direct-product\odot denotes element-wise multiplication. Since minimizing the 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm is NP-hard, we employ the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm as a surrogate for the 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm, promoting sparsity while allowing for efficient optimization Wright and Ma (2022).

Algorithm 1 Training Procedure of EliMipl

Inputs:
𝒟𝒟\mathcal{D}caligraphic_D : MIPL training set {(𝑿i,𝒮i)1im}conditional-setsubscript𝑿𝑖subscript𝒮𝑖1𝑖𝑚\{(\boldsymbol{X}_{i},\mathcal{S}_{i})\mid 1\leq i\leq m\}{ ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ 1 ≤ italic_i ≤ italic_m }
μ𝜇\muitalic_μ, γ𝛾\gammaitalic_γ : Weights for sparsity loss and inhibition loss
T𝑇Titalic_T: Maximum number of epochs
Process:

1:  Initialize uniform weights wi,c(0)superscriptsubscript𝑤𝑖𝑐0w_{i,c}^{(0)}italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT (c𝒮i𝑐subscript𝒮𝑖c\in\mathcal{S}_{i}italic_c ∈ caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)
2:  for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
3:     Fetch a mini-batch \mathcal{B}caligraphic_B from 𝒟𝒟\mathcal{D}caligraphic_D
4:     for 𝑿𝑿\boldsymbol{X}\in\mathcal{B}bold_italic_X ∈ caligraphic_B do
5:        Extract instance-level features using Equation (1)
6:        Calculate attention scores using Equations (2, 3)
7:        Aggregate instance-level features into bag-level feature representations via Equation (4)
8:        Update weights wi,c(t)superscriptsubscript𝑤𝑖𝑐𝑡w_{i,c}^{(t)}italic_w start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on Equation (6)
9:        Calculate masubscriptma\mathcal{L}_{\text{ma}}caligraphic_L start_POSTSUBSCRIPT ma end_POSTSUBSCRIPT, spsubscriptsp\mathcal{L}_{\text{sp}}caligraphic_L start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT, and insubscriptin\mathcal{L}_{\text{in}}caligraphic_L start_POSTSUBSCRIPT in end_POSTSUBSCRIPT via Equations (5, 7, 8)
10:        Calculate total loss \mathcal{L}caligraphic_L as in Equation (9)
11:        Set gradient Φ-\bigtriangledown_{\Phi}\mathcal{L}- ▽ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT caligraphic_L
12:        Update ΦΦ\Phiroman_Φ using optimizer
13:     end for
14:  end for
Dataset #bag #ins max. #ins min. #ins avg. #ins #dim #class avg. #CLs domain
MNIST-MIPL (MNIST) 500 20664 48 35 41.33 784 5 2, 3, 4 image
FMNIST-MIPL (FMNIST) 500 20810 48 36 41.62 784 5 2, 3, 4 image
Birdsong-MIPL (Birdsong) 1300 48425 76 25 37.25 38 13 2, 3, 4 biology
SIVAL-MIPL (SIVAL) 1500 47414 32 31 31.61 30 25 2, 3, 4 image
CRC-MIPL-Row (C-Row) 7000 56000 8 8 8 9 7 2.08 image
CRC-MIPL-SBN (C-SBN) 7000 63000 9 9 9 15 7 2.08 image
CRC-MIPL-KMeansSeg (C-KMeans) 7000 30178 6 3 4.311 6 7 2.08 image
CRC-MIPL-SIFT (C-SIFT) 7000 175000 25 25 25 128 7 2.08 image
Table 1: Characteristics of the benchmark and real-world MIPL datasets.

Non-candidate Label Information.

For a multi-instance bag 𝑿isubscript𝑿𝑖\boldsymbol{X}_{i}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT linked to a candidate label set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the non-candidate label set 𝒮¯isubscript¯𝒮𝑖\bar{\mathcal{S}}_{i}over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT complements the candidate label set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within the label space 𝒴𝒴\mathcal{Y}caligraphic_Y. As the label space has a fixed size, an antagonistic relationship arises between the non-candidate and candidate label sets. To enhance the classifier’s prediction probabilities for the candidate label set, a natural strategy is to diminish the classifier’s prediction probabilities for the non-candidate label set. Motivated by this insight, we introduce an inhibition loss as follows:

in(𝒵,𝒮¯)=1mi=1mc¯𝒮i¯log(1fc¯(𝒛i)),subscriptin𝒵¯𝒮1𝑚superscriptsubscript𝑖1𝑚subscript¯𝑐¯subscript𝒮𝑖1subscript𝑓¯𝑐subscript𝒛𝑖\mathcal{L}_{\text{in}}(\mathcal{Z},\bar{\mathcal{S}})=-\frac{1}{m}\sum_{i=1}^% {m}\sum_{\bar{c}\in\bar{\mathcal{S}_{i}}}\log(1-f_{\bar{c}}(\boldsymbol{z}_{i}% )),caligraphic_L start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( caligraphic_Z , over¯ start_ARG caligraphic_S end_ARG ) = - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over¯ start_ARG italic_c end_ARG ∈ over¯ start_ARG caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT roman_log ( 1 - italic_f start_POSTSUBSCRIPT over¯ start_ARG italic_c end_ARG end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (8)

where fc¯()subscript𝑓¯𝑐f_{\bar{c}}(\cdot)italic_f start_POSTSUBSCRIPT over¯ start_ARG italic_c end_ARG end_POSTSUBSCRIPT ( ⋅ ) denotes the classifier’s prediction probability over the non-candidate label c¯¯𝑐\bar{c}over¯ start_ARG italic_c end_ARG.

CLI Loss.

During the training, CLI is formed by a loss function named CLI loss that is a weighted fusion of the mapping loss, sparsity loss, and inhibition loss, as shown below:

=ma(𝒵,𝒮)+μsp(𝒮)+γin(𝒵,𝒮¯),subscriptma𝒵𝒮𝜇subscriptsp𝒮𝛾subscriptin𝒵¯𝒮\mathcal{L}=\mathcal{L}_{\text{ma}}(\mathcal{Z},\mathcal{S})+\mu\mathcal{L}_{% \text{sp}}(\mathcal{S})+\gamma\mathcal{L}_{\text{in}}(\mathcal{Z},\bar{% \mathcal{S}}),caligraphic_L = caligraphic_L start_POSTSUBSCRIPT ma end_POSTSUBSCRIPT ( caligraphic_Z , caligraphic_S ) + italic_μ caligraphic_L start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT ( caligraphic_S ) + italic_γ caligraphic_L start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( caligraphic_Z , over¯ start_ARG caligraphic_S end_ARG ) , (9)

where μ𝜇\muitalic_μ and γ𝛾\gammaitalic_γ represent the weighting coefficients for the sparsity loss and the inhibition loss, respectively.

Algorithm 1 summarizes the training procedure of EliMipl. Firstly, the algorithm initializes the weights for the mapping loss uniformly (Step 1111). Subsequently, instance-level features are extracted and aggregated into bag-level features within each mini-batch (Steps 5555-8888). The algorithm then updates the weights for the mapping loss and calculates the total loss function (Steps 9999-11111111). Finally, the model is optimized using gradient descent (Steps 12121212 and 13131313).

4 Experiments

In this section, we begin by introducing the experimental configurations, including the datasets, comparative algorithms, and the parameters used in the experiments. Subsequently, we present the experimental results on both benchmark and real-world datasets. Finally, we conduct further analysis to gain deeper insights into the impact of CLI.

4.1 Experimental Configurations

Datasets.

We employ four benchmark MIPL datasets Tang et al. (2024, 2023): MNIST-MIPL, FMNIST-MIPL, Birdsong-MIPL, and SIVAL-MIPL, spanning diverse domains such as image analysis and biology LeCun et al. (1998); Xiao et al. (2017); Briggs et al. (2012); Settles et al. (2007). The characteristics of the datasets are presented in Table 1, where the abbreviations within parentheses in the first column represent the abbreviated names of the MIPL datasets. The dataset includes quantities of multi-instance bags and total instances, denoted as #bag and #ins, respectively. Additionally, we use max. #ins, min. #ins, and avg. #ins to indicate the maximum, minimum, and average instance count within all bags. The dimensionality of the instance-level feature is represented by #dim. Labeling details are elucidated using #class and avg. #CLs, signifying the length of the label space and the average length of candidate label sets, respectively. For a comprehensive performance assessment, we vary the count of false positive labels, denoted as r𝑟ritalic_r (|𝒮i|=r+1subscript𝒮𝑖𝑟1|\mathcal{S}_{i}|=r+1| caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = italic_r + 1).

CRC-MIPL dataset is a real-world MIPL dataset for colorectal cancer classification. We utilize multi-instance features generated by four image bag generators Wei and Zhou (2016): Row Maron and Ratan (1998), single blob with neighbors (SBN) Maron and Ratan (1998), k-means segmentation (KMeansSeg) Zhang et al. (2002), and scale-invariant feature transform (SIFT) Lowe (2004).

The appendix contains detailed information about the datasets and the four image bag generators.

Comparative Algorithms.

We conduct a comprehensive comparison involving EliMipl along with two established MIPL algorithms: MiplGp Tang et al. (2024) and DeMipl Tang et al. (2023). These represent the entirety of available MIPL methods. Furthermore, we include four PLL algorithms: Proden Lv et al. (2020), Rc Feng et al. (2020), Lws Wen et al. (2021), and Pl-aggd Wang et al. (2022a). The first three algorithms can be equipped with diverse backbone networks, such as linear models and MLP. Due to spatial constraints, we present the results obtained from the linear models in the main body, while the results with MLP are shown in the appendix. Parameters for all algorithms are selected based on recommendations from original literature or refined through our search for enhanced outcomes.

Since PLL algorithms are not directly tailored for MIPL data, two common strategies, known as the Mean strategy and the MaxMin strategy, are employed to adapt MIPL data for PLL algorithms Tang et al. (2024). The Mean strategy involves calculating average feature values across all instances within a bag, resulting in a bag-level feature representation. In contrast, the MaxMin strategy identifies both the maximum and minimum feature values for each dimension among instances within a bag, and then concatenates these values to form a bag-level feature representation.

Implementation.

We implement EliMipl using PyTorch and execute it on a single NVIDIA Tesla V100 GPU. We utilize the stochastic gradient descent (SGD) optimizer with a momentum value of 0.90.90.90.9 and a weight decay of 0.00010.00010.00010.0001. The initial learning rate is selected from the set {0.01,0.05}0.010.05\{0.01,0.05\}{ 0.01 , 0.05 } and accompanied by a cosine annealing technique. We set the number of epochs uniformly to 100100100100 for all datasets. For the MNIST-MIPL and FMNIST-MIPL datasets, μ𝜇\muitalic_μ is set to 1111 or 0.10.10.10.1, γ𝛾\gammaitalic_γ is chosen from {0.1,0.5}0.10.5\{0.1,0.5\}{ 0.1 , 0.5 }, and the feature extraction network ψ1()subscript𝜓1\psi_{1}(\cdot)italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) is a two-layer convolutional neural network. For the remaining datasets, we set both μ𝜇\muitalic_μ and γ𝛾\gammaitalic_γ to 10101010, and ψ1()subscript𝜓1\psi_{1}(\cdot)italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) is an identity transformation. The feature transformation network ψ2()subscript𝜓2\psi_{2}(\cdot)italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) is implemented by a fully connected network, with the dimension l𝑙litalic_l set to 512512512512 for the CRC-MIPL dataset and 128128128128 for the other datasets. The way of dataset partitioning is consistent with that of DeMipl. We conduct ten random train/test splits with a ratio of 7:3:737:37 : 3. We report the mean accuracies and standard deviations obtained from the ten runs, with the highest accuracy highlighted in bold. The code of EliMipl can be found at https://github.com/tangw-seu/ELIMIPL.

Algorithm r𝑟ritalic_r MNIST FMNIST Birdsong SIVAL
EliMipl 1 .992±plus-or-minus\pm±.007 .903±plus-or-minus\pm±.018 .771±plus-or-minus\pm±.018 .675±plus-or-minus\pm±.022
2 .987±plus-or-minus\pm±.010 .845±plus-or-minus\pm±.026 .745±plus-or-minus\pm±.015 .616±plus-or-minus\pm±.025
3 .748±plus-or-minus\pm±.144 .702±plus-or-minus\pm±.055 .717±plus-or-minus\pm±.017 .600±plus-or-minus\pm±.029
DeMipl 1 .976±plus-or-minus\pm±.008 .881±plus-or-minus\pm±.021 .744±plus-or-minus\pm±.016 .635±plus-or-minus\pm±.041
2 .943±plus-or-minus\pm±.027 .823±plus-or-minus\pm±.028 .701±plus-or-minus\pm±.024 .554±plus-or-minus\pm±.051
3 .709±plus-or-minus\pm±.088 .657±plus-or-minus\pm±.025 .696±plus-or-minus\pm±.024 .503±plus-or-minus\pm±.018
MiplGp 1 .949±plus-or-minus\pm±.016 .847±plus-or-minus\pm±.030 .716±plus-or-minus\pm±.026 .669±plus-or-minus\pm±.019
2 .817±plus-or-minus\pm±.030 .791±plus-or-minus\pm±.027 .672±plus-or-minus\pm±.015 .613±plus-or-minus\pm±.026
3 .621±plus-or-minus\pm±.064 .670±plus-or-minus\pm±.052 .625±plus-or-minus\pm±.015 .569±plus-or-minus\pm±.032
Mean
Proden 1 .605±plus-or-minus\pm±.023 .697±plus-or-minus\pm±.042 .296±plus-or-minus\pm±.014 .219±plus-or-minus\pm±.014
2 .481±plus-or-minus\pm±.036 .573±plus-or-minus\pm±.026 .272±plus-or-minus\pm±.019 .184±plus-or-minus\pm±.014
3 .283±plus-or-minus\pm±.028 .345±plus-or-minus\pm±.027 .211±plus-or-minus\pm±.013 .166±plus-or-minus\pm±.017
Rc 1 .658±plus-or-minus\pm±.031 .753±plus-or-minus\pm±.042 .362±plus-or-minus\pm±.015 .279±plus-or-minus\pm±.011
2 .598±plus-or-minus\pm±.033 .649±plus-or-minus\pm±.028 .335±plus-or-minus\pm±.011 .258±plus-or-minus\pm±.017
3 .392±plus-or-minus\pm±.033 .401±plus-or-minus\pm±.063 .298±plus-or-minus\pm±.009 .237±plus-or-minus\pm±.020
Lws 1 .463±plus-or-minus\pm±.048 .726±plus-or-minus\pm±.031 .265±plus-or-minus\pm±.010 .240±plus-or-minus\pm±.014
2 .209±plus-or-minus\pm±.028 .720±plus-or-minus\pm±.025 .254±plus-or-minus\pm±.010 .223±plus-or-minus\pm±.008
3 .205±plus-or-minus\pm±.013 .579±plus-or-minus\pm±.041 .237±plus-or-minus\pm±.005 .194±plus-or-minus\pm±.026
Pl-aggd 1 .671±plus-or-minus\pm±.027 .743±plus-or-minus\pm±.026 .353±plus-or-minus\pm±.019 .355±plus-or-minus\pm±.015
2 .595±plus-or-minus\pm±.036 .677±plus-or-minus\pm±.028 .314±plus-or-minus\pm±.018 .315±plus-or-minus\pm±.019
3 .380±plus-or-minus\pm±.032 .474±plus-or-minus\pm±.057 .296±plus-or-minus\pm±.015 .286±plus-or-minus\pm±.018
MaxMin
Proden 1 .508±plus-or-minus\pm±.024 .424±plus-or-minus\pm±.045 .387±plus-or-minus\pm±.014 .316±plus-or-minus\pm±.019
2 .400±plus-or-minus\pm±.037 .377±plus-or-minus\pm±.040 .357±plus-or-minus\pm±.012 .287±plus-or-minus\pm±.024
3 .345±plus-or-minus\pm±.048 .309±plus-or-minus\pm±.058 .336±plus-or-minus\pm±.012 .250±plus-or-minus\pm±.018
Rc 1 .519±plus-or-minus\pm±.028 .731±plus-or-minus\pm±.027 .390±plus-or-minus\pm±.014 .306±plus-or-minus\pm±.023
2 .469±plus-or-minus\pm±.035 .666±plus-or-minus\pm±.027 .371±plus-or-minus\pm±.013 .288±plus-or-minus\pm±.021
3 .380±plus-or-minus\pm±.048 .524±plus-or-minus\pm±.034 .363±plus-or-minus\pm±.010 .267±plus-or-minus\pm±.020
Lws 1 .242±plus-or-minus\pm±.042 .435±plus-or-minus\pm±.049 .225±plus-or-minus\pm±.038 .289±plus-or-minus\pm±.017
2 .239±plus-or-minus\pm±.048 .406±plus-or-minus\pm±.040 .207±plus-or-minus\pm±.034 .271±plus-or-minus\pm±.014
3 .218±plus-or-minus\pm±.017 .318±plus-or-minus\pm±.064 .216±plus-or-minus\pm±.029 .244±plus-or-minus\pm±.023
Pl-aggd 1 .527±plus-or-minus\pm±.035 .391±plus-or-minus\pm±.040 .383±plus-or-minus\pm±.014 .397±plus-or-minus\pm±.028
2 .439±plus-or-minus\pm±.020 .371±plus-or-minus\pm±.037 .372±plus-or-minus\pm±.020 .360±plus-or-minus\pm±.029
3 .321±plus-or-minus\pm±.043 .327±plus-or-minus\pm±.028 .344±plus-or-minus\pm±.011 .328±plus-or-minus\pm±.023
Table 2: The classification accuracies (mean±plus-or-minus\pm±std) of EliMipl and comparative algorithms on the benchmark datasets with varying numbers of false positive candidate labels (r{1,2,3}𝑟123r\in\{1,2,3\}italic_r ∈ { 1 , 2 , 3 }).

4.2 Results on the Benchmark Datasets

Table 2 presents the results of EliMipl and the comparative algorithms on benchmark datasets, considering varying numbers of false positive labels (r{1,2,3}𝑟123r\in\{1,2,3\}italic_r ∈ { 1 , 2 , 3 }). Compared to MIPL algorithms, EliMipl consistently achieves higher average accuracy than DeMipl and MiplGp. Furthermore, in contrast to PLL algorithms, EliMipl significantly outperforms them in all cases.

For the MNIST-MIPL and FMNIST-MIPL datasets, each with 5555 class labels, EliMipl achieves an average accuracy at least 0.0160.0160.0160.016 higher than DeMipl and between 0.0320.0320.0320.032 to 0.170.170.170.17 higher than MiplGp. In the case of the Birdsong-MIPL dataset that comprises 13131313 class labels, EliMipl’s average accuracy surpasses DeMipl by at least 0.0210.0210.0210.021 and MiplGp by at least 0.0550.0550.0550.055. The SIVAL-MIPL dataset spans 25252525 class labels, encompassing diverse categories such as fruits and commodities. EliMipl’s average accuracy surpasses DeMipl by 0.040.040.040.04 to 0.0970.0970.0970.097 and MiplGp by an average of 0.0130.0130.0130.013. Notably, DeMipl demonstrates relatively superior performance with fewer class labels, while MiplGp excels in scenarios with more class labels. In contrast, EliMipl consistently maintains the highest average accuracy in both fewer and more class labels. This indicates that EliMipl exhibits superior capabilities compared to existing MIPL algorithms. PLL algorithms exhibit decent results on the MNIST-MIPL and FMNIST-MIPL datasets when r=1𝑟1r=1italic_r = 1 or r=2𝑟2r=2italic_r = 2. However, their performance significantly deteriorates when r=3𝑟3r=3italic_r = 3 or on the Birdsong-MIPL and SIVAL-MIPL datasets. This observation underscores the intrinsic complexity of MIPL problems, highlighting that they cannot be reduced to PLL problems.

The above analysis not only highlights the robustness of EliMipl across diverse label space but also emphasizes the limitations of addressing MIPL problems using PLL algorithms. The results underscore the importance of algorithmic designs specifically tailored to MIPL tasks.

4.3 Results on the Real-World Datasets

Algorithm C-Row C-SBN C-KMeans C-SIFT
EliMipl .433±plus-or-minus\pm±.008 .509±plus-or-minus\pm±.007 .546±plus-or-minus\pm±.012 .540±plus-or-minus\pm±.010
DeMipl .408±plus-or-minus\pm±.010 .486±plus-or-minus\pm±.014 .521±plus-or-minus\pm±.012 .532±plus-or-minus\pm±.013
MiplGp .432±plus-or-minus\pm±.005 .335±plus-or-minus\pm±.006 .329±plus-or-minus\pm±.012
Mean
Proden .365±plus-or-minus\pm±.009 .392±plus-or-minus\pm±.008 .233±plus-or-minus\pm±.018 .334±plus-or-minus\pm±.029
Rc .214±plus-or-minus\pm±.011 .242±plus-or-minus\pm±.012 .226±plus-or-minus\pm±.009 .209±plus-or-minus\pm±.007
Lws .291±plus-or-minus\pm±.010 .310±plus-or-minus\pm±.006 .237±plus-or-minus\pm±.008 .270±plus-or-minus\pm±.007
Pl-aggd .412±plus-or-minus\pm±.008 .480±plus-or-minus\pm±.005 .358±plus-or-minus\pm±.008 .363±plus-or-minus\pm±.012
MaxMin
Proden .401±plus-or-minus\pm±.007 .447±plus-or-minus\pm±.011 .265±plus-or-minus\pm±.027 .291±plus-or-minus\pm±.011
Rc .227±plus-or-minus\pm±.012 .338±plus-or-minus\pm±.010 .208±plus-or-minus\pm±.007 .246±plus-or-minus\pm±.008
Lws .299±plus-or-minus\pm±.008 .382±plus-or-minus\pm±.009 .247±plus-or-minus\pm±.005 .230±plus-or-minus\pm±.007
Pl-aggd .460±plus-or-minus\pm±.008 .524±plus-or-minus\pm±.008 .434±plus-or-minus\pm±.009 .285±plus-or-minus\pm±.009
Table 3: The classification accuracies (mean±plus-or-minus\pm±std) of EliMipl and comparative algorithms on the real-world datasets.
\begin{overpic}[width=398.33858pt]{./figs/fig4.pdf} \put(-20.0,52.0){\small\rotatebox{90.0}{accuracy}} \put(880.0,142.0){\small{{EliMipl}}} \put(880.0,119.0){\small{{DeMipl}}} \put(880.0,97.0){\small{{MiplGp}}} \put(78.0,-5.0){\small$r=1$} \put(167.0,-5.0){\small$r=2$} \put(259.0,-5.0){\small$r=3$} \put(350.0,-5.0){\small$r=4$} \put(442.0,-5.0){\small$r=5$} \put(532.0,-5.0){\small$r=6$} \put(623.0,-5.0){\small$r=7$} \put(714.0,-5.0){\small$r=8$} \put(800.0,-5.0){\small$r=9$} \put(888.0,-5.0){\small$r=10$} \end{overpic}
Figure 4: The classification accuracies of EliMipl, DeMipl, and MiplGp on the Birdsong-MIPL dataset with varying r𝑟ritalic_r.
\begin{overpic}[width=162.1807pt]{./figs/fig5.pdf} \put(-10.0,220.0){\rotatebox{90.0}{attention score}} \put(495.0,-15.0){index} \put(1030.0,550.0){\includegraphics[width=31.29802pt]{./figs/MNIST_r1/3.pdf}} \put(1035.0,508.0){\small index: {\color[rgb]{1,0,0}$3$}} \put(1030.0,290.0){\includegraphics[width=31.29802pt]{./figs/MNIST_r1/25.pdf}} \put(1025.0,245.0){\small index: {\color[rgb]{1,0,0}$25$}} \put(1030.0,30.0){\includegraphics[width=31.29802pt]{./figs/MNIST_r1/28.pdf}} \put(1025.0,-15.0){\small index: {\color[rgb]{1,0,0}$28$}} \put(1256.0,550.0){\includegraphics[width=31.29802pt]{./figs/MNIST_r1/1.pdf}} \put(1260.0,508.0){\small index: {\color[rgb]{0.2734375,0.51171875,0.70703125}% $1$}} \put(1255.0,290.0){\includegraphics[width=31.29802pt]{./figs/MNIST_r1/23.pdf}} \put(1250.0,245.0){\small index: {\color[rgb]{0.2734375,0.51171875,0.70703125}% $23$}} \put(1255.0,30.0){\includegraphics[width=31.29802pt]{./figs/MNIST_r1/31.pdf}} \put(1250.0,-15.0){\small index: {\color[rgb]{0.2734375,0.51171875,0.70703125}% $31$}} \end{overpic}
Figure 5: Attention scores for a test bag. Red and blue are the attention scores of positive and negative instances, respectively.

Table 3 provides the results of EliMipl and the comparative algorithms on the CRC-MIPL dataset. The symbol – denotes cases where results could not be obtained due to memory overflow on our server. Compared to MIPL algorithms, EliMipl consistently achieves the highest average accuracies. Additionally, in comparison to PLL algorithms, EliMipl significantly outperforms them in 30303030 out of 32323232 cases.

For the CRC-MIPL dataset, both EliMipl and DeMipl exhibit improved performance as the complexity of the image bag generator increases. This observation aligns with the intuition that, while avoiding overfitting, intricate feature extractors tend to produce higher classification accuracies. However, this phenomenon is not consistently observed for MiplGp and PLL algorithms. For example, these algorithms do not consistently achieve superior results on the C-KMeans and C-SIFT datasets compared to the results on the CRC-Row or C-SBN dataset. We posit that the intricate features exceed the capability limits of these algorithms. Thus, the development of effective MIPL algorithms becomes imperative.

In most cases, the MaxMin strategy tends to yield superior outcomes than the Mean strategy. We postulate that this difference arises from the significant distinction between tissue cells and the background in the CRC-MIPL dataset. Applying the Mean strategy to features generated by simple image bag generators (i.e., Row and SBN) diminishes the distinction between tissue cells and the background, making it challenging to learn discriminative features. Conversely, for features generated by more complex image bag generators (i.e., KMeansSeg and SIFT), both the Mean and MaxMin strategies demonstrate their respective merits. Therefore, both strategies are worthy of consideration and application.

Dataset r𝑟ritalic_r EliMipl Ma+Sp Ma+In Ma
Birdsong 1 .771±plus-or-minus\pm±.018 .742±plus-or-minus\pm±.014 .746±plus-or-minus\pm±.015 .733±plus-or-minus\pm±.011
2 .745±plus-or-minus\pm±.015 .665±plus-or-minus\pm±.024 .689±plus-or-minus\pm±.020 .677±plus-or-minus\pm±.017
3 .717±plus-or-minus\pm±.017 .592±plus-or-minus\pm±.031 .674±plus-or-minus\pm±.023 .652±plus-or-minus\pm±.016
SIVAL 1 .675±plus-or-minus\pm±.022 .618±plus-or-minus\pm±.021 .626±plus-or-minus\pm±.019 .620±plus-or-minus\pm±.022
2 .616±plus-or-minus\pm±.025 .532±plus-or-minus\pm±.041 .550±plus-or-minus\pm±.040 .540±plus-or-minus\pm±.038
3 .600±plus-or-minus\pm±.029 .545±plus-or-minus\pm±.027 .521±plus-or-minus\pm±.025 .521±plus-or-minus\pm±.032
Table 4: The classification accuracies of the variants on the Birdsong-MIPL and SIVAL-MIPL datasets.

4.4 Further Analyses

Effectiveness of CLI. To evaluate the impact of CLI, we modify the loss function in Equation (9) and propose three variants: Ma+Sp, Ma+In, and Ma. These variants respectively represent the removal of inhibition loss, sparsity loss, and the simultaneous elimination of both inhibition and sparsity losses. Table 4 presents the experimental results conducted on the Birdsong-MIPL and SIVAL-MIPL datasets. With Ma as the baseline, the introduction of individual sparse loss or inhibition loss tends to yield marginal performance improvements in most cases, while in some cases, performance degradation may occur. In contrast, EliMipl, using the CLI demonstrates a substantial boost in classification accuracy.

Challenging Disambiguation Scenarios. We select different quantities of false positive labels from 1111 to 10101010 on the Birdsong-MIPL dataset. Figure 4 presents the experimental results of EliMipl, DeMipl, and MiplGp with r{1,2,,10}𝑟1210r\in\{1,2,\cdots,10\}italic_r ∈ { 1 , 2 , ⋯ , 10 }. Particularly, EliMipl and DeMipl adhere to the embedded-space paradigm, while MiplGp follows the instance-space paradigm. Three distinct phenomena are observed: (a) EliMipl consistently exhibits higher average accuracy compared to DeMipl and MiplGp. (b) For r<7𝑟7r<7italic_r < 7, DeMipl outperforms MiplGp. However, when r7𝑟7r\geq 7italic_r ≥ 7, MiplGp surpasses DeMipl. (c) The gaps between EliMipl and DeMipl are greater when r{6,7,8,9,10}𝑟678910r\in\{6,7,8,9,10\}italic_r ∈ { 6 , 7 , 8 , 9 , 10 } than when r{1,2,3,4,5}𝑟12345r\in\{1,2,3,4,5\}italic_r ∈ { 1 , 2 , 3 , 4 , 5 }. The widening gaps signify the growing significance of the scaled additive attention mechanism and CLI. Therefore, Figure 4 clearly demonstrates that EliMipl outperforms both MiplGp and DeMipl in disambiguation, even when confronted with challenging scenarios.

Algorithm r𝑟ritalic_r MNIST FMNIST Birdsong SIVAL
CLI loss 1 .992±plus-or-minus\pm±.007 .903±plus-or-minus\pm±.018 .771±plus-or-minus\pm±.018 .675±plus-or-minus\pm±.022
2 .987±plus-or-minus\pm±.010 .845±plus-or-minus\pm±.026 .745±plus-or-minus\pm±.015 .616±plus-or-minus\pm±.025
3 .748±plus-or-minus\pm±.144 .702±plus-or-minus\pm±.055 .717±plus-or-minus\pm±.017 .600±plus-or-minus\pm±.029
Ce-Sp-In 1 .899±plus-or-minus\pm±.037 .825±plus-or-minus\pm±.035 .740±plus-or-minus\pm±.013 .639±plus-or-minus\pm±.030
2 .847±plus-or-minus\pm±.027 .679±plus-or-minus\pm±.037 .687±plus-or-minus\pm±.024 .587±plus-or-minus\pm±.022
3 .636±plus-or-minus\pm±.112 .610±plus-or-minus\pm±.037 .592±plus-or-minus\pm±.036 .578±plus-or-minus\pm±.022
Ce 1 .919±plus-or-minus\pm±.017 .709±plus-or-minus\pm±.257 .704±plus-or-minus\pm±.019 .587±plus-or-minus\pm±.028
2 .833±plus-or-minus\pm±.016 .645±plus-or-minus\pm±.044 .616±plus-or-minus\pm±.032 .534±plus-or-minus\pm±.025
3 .628±plus-or-minus\pm±.096 .551±plus-or-minus\pm±.032 .459±plus-or-minus\pm±.045 .514±plus-or-minus\pm±.025
Table 5: The comparison between the CLI loss and the CE loss.

Comparison of CLI and Cross-Entropy Loss. For a comparative analysis between the CLI loss and the cross-entropy (CE) loss, we substitute the mapping loss and the CLI loss with the CE loss, resulting in variants Ce-Sp-In (which utilizes CE loss, sparsity loss, and inhibition loss) and Ce (which only utilizes CE loss), respectively. Table 5 illustrates that, in all cases, accuracies obtained with the CLI loss surpass those achieved with Ce-Sp-In and Ce. Notably, the incorporation of inhibition loss and sparsity loss enhances the performance of the CE loss, underscoring the importance of considering the intrinsic properties of the label space and the information from non-candidate label sets.

Interpretability of the Attention Mechanism. Figure 5 displays the attention scores of a test multi-instance bag in the MNIST-MIPL dataset (r=1𝑟1r=1italic_r = 1). The bag contains positive instances represented by the digit 6666, while negative instances are drawn from digits {1,3,5,7,9}13579\{1,3,5,7,9\}{ 1 , 3 , 5 , 7 , 9 }. Additionally, we visualize the attention scores of all three positive instances and the negative instances. Figure 5 illustrates that EliMipl can accurately identify all positive instances by assigning significantly higher attention scores to them, and the attention scores can be directly utilized for interpretability.

5 Conclusion

This paper investigates a multi-instance partial-label learning algorithm that introduces a scaled additive attention mechanism and exploits conjugate label information. This information includes both candidate label information and non-candidate label information, along with the sparsity of the true label matrix. Experimental results demonstrate the superiority of our proposed EliMipl algorithm. The utilization of CLI proves significantly more effective than relying on incomplete label information, especially in challenging disambiguation scenarios. In the future, we will explore the instance-depend MIPL algorithm and conduct theoretical analyses to develop more effective algorithms.

References

  • Amores [2013] Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.
  • Briggs et al. [2012] Forrest Briggs, Xiaoli Z. Fern, and Raviv Raich. Rank-loss support instance machines for MIML instance annotation. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, pages 534–542, 2012.
  • Campanella et al. [2019] Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine, 25(8):1301–1309, 2019.
  • Carbonneau et al. [2018] Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
  • Chen et al. [2022] Wei-Chen Chen, Xin-Yi Yu, and Linlin Ou. Pedestrian attribute recognition in video surveillance scenarios based on view-attribute attention localization. Machine Intelligence Research, 19(2):153–168, 2022.
  • Cour et al. [2011] Timothee Cour, Ben Sapp, and Ben Taskar. Learning from partial labels. The Journal of Machine Learning Research, 12:1501–1536, 2011.
  • Cui et al. [2023] Yufei Cui, Ziquan Liu, Xiangyu Liu, Xue Liu, Cong Wang, Tei-Wei Kuo, Chun Jason Xue, and Antoni B. Chan. Bayes-MIL: A new probabilistic perspective on attention-based multiple instance learning for whole slide images. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, pages 1–17, 2023.
  • Dietterich et al. [1997] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2):31–71, 1997.
  • Feng and An [2019] Lei Feng and Bo An. Partial label learning with self-guided retraining. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, pages 3542–3549, 2019.
  • Feng et al. [2020] Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin Geng, Bo An, and Masashi Sugiyama. Provably consistent partial-label learning. In Advances in Neural Information Processing Systems 33, Virtual Event, pages 10948–10960, 2020.
  • Gong et al. [2022] Xiuwen Gong, Dong Yuan, and Wei Bao. Partial label learning via label influence function. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, pages 7665–7678, 2022.
  • Grote et al. [2019] Anne Grote, Nadine S. Schaadt, Germain Forestier, Cédric Wemmert, and Friedrich Feuerhake. Crowdsourcing of histological image labeling and object delineation by medical students. IEEE Transactions Medical Imaging, 38(5):1284–1294, 2019.
  • He et al. [2022] Shuo He, Lei Feng, Fengmao Lv, Wen Li, and Guowu Yang. Partial label learning with semantic label representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pages 545–553, 2022.
  • Ilse et al. [2018] Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholmsmässan, Stockholm, Sweden, pages 2132–2141, 2018.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. [2023] Ximing Li, Yuanzhi Jiang, Changchun Li, Yiyuan Wang, and Jihong Ouyang. Learning with partial labels from semi-supervised perspective. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, pages 8666–8674, 2023.
  • Lowe [2004] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • Lv et al. [2020] Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true labels for partial-label learning. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, pages 6500–6510, 2020.
  • Lyu et al. [2020] Gengyu Lyu, Songhe Feng, Yidong Li, Yi Jin, Guojun Dai, and Congyan Lang. HERA: Partial label learning by combining heterogeneous loss with sparse and low-rank regularization. ACM Transactions on Intelligent Systems and Technology, 11(3):1–19, 2020.
  • Maron and Ratan [1998] Oded Maron and Aparna Lakshmi Ratan. Multiple-instance learning for natural scene classification. In Proceedings of the 15th International Conference on Machine Learning, Madison, Wisconsin, USA, pages 341–349, 1998.
  • Settles et al. [2007] Burr Settles, Mark Craven, and Soumya Ray. Multiple-instance active learning. In Advances in Neural Information Processing Systems 20, Vancouver, British Columbia, Canada, pages 1289–1296, 2007.
  • Shi et al. [2020] Xiaoshuang Shi, Fuyong Xing, Yuanpu Xie, Zizhao Zhang, Lei Cui, and Lin Yang. Loss-based attention for deep multiple instance learning. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, pages 5742–5749, 2020.
  • Tan et al. [2023] Shuo Tan, Lei Zhang, Xin Shu, and Zizhou Wang. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks. Frontiers of Computer Science, 17(6):176338: 1–10, 2023.
  • Tang et al. [2023] Wei Tang, Weijia Zhang, and Min-Ling Zhang. Disambiguated attention embedding for multi-instance partial-label learning. In Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, pages 56756–56771, 2023.
  • Tang et al. [2024] Wei Tang, Weijia Zhang, and Min-Ling Zhang. Multi-instance partial-label learning: Towards exploiting dual inexact supervision. Science China Information Sciences, 67(3):Article 132103: 1–14, 2024.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, pages 5998–6008, 2017.
  • Wang et al. [2018] Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. Revisiting multiple instance neural networks. Pattern Recognition, 74:15–24, 2018.
  • Wang et al. [2022a] Deng-Bao Wang, Min-Ling Zhang, and Li Li. Adaptive graph guided disambiguation for partial label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8796–8811, 2022.
  • Wang et al. [2022b] Yang Wang, Jinjia Peng, Huibing Wang, and Meng Wang. Progressive learning with multi-scale attention network for cross-domain vehicle re-identification. Science China Information Sciences, 65(6):160103:1–15, 2022.
  • Wei and Zhou [2016] Xiu-Shen Wei and Zhi-Hua Zhou. An empirical study on image bag generators for multi-instance learning. Machine Learning, 105:155–198, 2016.
  • Wen et al. [2021] Hongwei Wen, Jingyi Cui, Hanyuan Hang, Jiabin Liu, Yisen Wang, and Zhouchen Lin. Leveraged weighted loss for partial label learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, pages 11091–11100, 2021.
  • Wright and Ma [2022] John Wright and Yi Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022.
  • Wu et al. [2022] Dong-Dong Wu, Deng-Bao Wang, and Min-Ling Zhang. Revisiting consistency regularization for deep partial label learning. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, pages 24212–24225, 2022.
  • Xiang et al. [2023] Jinxi Xiang, Xiyue Wang, Jun Zhang, Sen Yang, Xiao Han, and Wei Yang. Exploring low-rank property in multiple instance learning for whole slide image classification. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, pages 1–18, 2023.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
  • Yao et al. [2020] Yao Yao, Jiehui Deng, Xiuhua Chen, Chen Gong, Jianxin Wu, and Jian Yang. Deep discriminative CNN with temporal ensembling for ambiguously-labeled image classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, pages 12669–12676, 2020.
  • Zhang et al. [2002] Qi Zhang, Sally A. Goldman, Wei Yu, and Jason E. Fritts. Content-based image retrieval using multiple-instance learning. In Proceedings of the 19th International Conference on Machine Learning, Sydney, Australia, pages 682–689, 2002.
  • Zhang et al. [2022a] Fei Zhang, Lei Feng, Bo Han, Tongliang Liu, Gang Niu, Tao Qin, and Masashi Sugiyama. Exploiting class activation value for partial-label learning. In Proceedings of the 10th International Conference on Learning Representations, Virtual Event, pages 1–17, 2022.
  • Zhang et al. [2022b] Hongrun Zhang, Yanda Meng, Yitian Zhao, Yihong Qiao, Xiaoyun Yang, Sarah E Coupland, and Yalin Zheng. DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the 35th IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pages 18802–18812, 2022.
  • Zhang et al. [2022c] Weijia Zhang, Xuanhui Zhang, Han-Wen Deng, and Min-Ling Zhang. Multi-instance causal representation learning for instance label prediction and out-of-distribution generalization. In Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, pages 34940–34953, 2022.
  • Zhang [2021] Weijia Zhang. Non-i.i.d. multi-instance learning for predicting instance and bag labels with variational auto-encoder. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, Virtual Event / Montreal, Canada, pages 3377–3383, 2021.
  • Zhou et al. [2009] Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. Multi-instance learning by treating instances as non-i.i.d. samples. In Proceedings of the 26th International Conference on Machine Learning, Montreal, Quebec, Canada, pages 1249–1256, 2009.
  • Zhou [2018] Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2018.