Multitask Active Learning for Graph Anomaly Detection

Wenjing Chang CNIC, CAS and UCASBeijingChina [email protected] , Kay Liu University of Illinois Chicago801 S Morgan StChicagoIllinoisUSA60607 [email protected] , Kaize Ding Northwestern University2006 Sheridan RoadEvanstonIllinoisUSA60208 [email protected] , Philip S. Yu University of Illinois Chicago801 S Morgan StChicagoIllinoisUSA60607 [email protected] and Jianjun Yu CNIC, CASBeijingChina [email protected]

(2024)

Abstract.

In the web era, graph machine learning has been widely used on ubiquitous graph-structured data. As a pivotal component for bolstering web security and enhancing the robustness of graph-based applications, the significance of graph anomaly detection is continually increasing. While Graph Neural Networks (GNNs) have demonstrated efficacy in supervised and semi-supervised graph anomaly detection, their performance is contingent upon the availability of sufficient ground truth labels. The labor-intensive nature of identifying anomalies from complex graph structures poses a significant challenge in real-world applications. Despite that, the indirect supervision signals from other tasks (e.g., node classification) are relatively abundant. In this paper, we propose a novel MultItask acTIve Graph Anomaly deTEction framework, namely MITIGATE. Firstly, by coupling node classification tasks, MITIGATE obtains the capability to detect out-of-distribution nodes without known anomalies. Secondly, MITIGATE quantifies the informativeness of nodes by the confidence difference across tasks, allowing samples with conflicting predictions to provide informative yet not excessively challenging information for subsequent training. Finally, to enhance the likelihood of selecting representative nodes that are distant from known patterns, MITIGATE adopts a masked aggregation mechanism for distance measurement, considering both inherent features of nodes and current labeled status. Empirical studies on four datasets demonstrate that MITIGATE significantly outperforms the state-of-the-art methods for anomaly detection. Our code is publicly available at: https://github.com/AhaChang/MITIGATE.

Anomaly Detection, Active Learning, Graph Neural Networks

^†^†copyright: acmcopyright^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Active learning settings^†^†ccs: Mathematics of computing Graph algorithms^†^†ccs: Security and privacy Software and application security

1. Introduction

In light of the proliferation of the World Wide Web, graph-structured data has become increasingly pervasive. Concurrently, graph machine learning techniques have been extensively employed in various web mining tasks, such as recommendation systems (yang2021consisrec, ), community detection (jin2021survey, ), traffic forecasting (zhang2021graph, ), etc. To ensure the robustness and security of such graph learning-based applications in web environments, graph anomaly detection serves as an indispensable component. Graph anomaly detection aims to identify abnormal substructures (e.g., nodes) in graphs that exhibit significant deviations from established norms. It finds extensive applications in capturing high-risk entities and behaviors, including but not limited to spam detection (noekhah2020opinion, ), financial fraud detection (wang2019semi, ), and fake news detection (dou2021user, ).

In accordance with the insights presented in (liu2022bond, ; ding2021few, ), unsupervised methods heavily rely on the underlying data distribution to derive outlier patterns. Consequently, these methods may exhibit unstable performance when faced with data that contains specific domain knowledge or deviates from the assumed distribution. However, the intricate nature of graph structures, along with the high cost of manual annotation for both normal and anomalous nodes, prevents the collection of abundant ground truth labels, thereby limiting the feasibility of applying fully supervised learning approaches. This contradiction necessitates the exploration of alternative learning paradigms that can efficiently leverage limited supervision signals while also accommodating the complexities inherent in graph data.

Given the considerable expense of acquiring ground-truth labels for anomaly detection, it is a judicious choice to leverage the existing relatively abundant availability of labels for other graph learning tasks. In the application of graph learning, anomaly detection can enhance the stability of various tasks (e.g., node classification) by filtering out anomalies. Conversely, these tasks can serve as auxiliary tasks for anomaly detection and reciprocally provide external information (e.g., classification uncertainty) for augmenting the efficacy of anomaly detection. Besides, these auxiliary tasks inherently contain general information that can be mutually leveraged, providing an indirect supervision signals when the anomaly detection task is deficient in annotation (collobert2008unified, ; zhang2021survey, ).

Furthermore, to effectively leverage limited direct supervision signals for anomaly detection, some semi-supervised methods have been proposed (wang2019semi, ; dou2020enhancing, ; tang2022rethinking, ). These methods aim to enhance the learning of anomalous patterns based on the known anomalous nodes. The underlying assumption of these methods is that a subset of nodes including both normal and anomalous nodes have been annotated for training, and the labeled data is overall balanced. Nevertheless, it is non-trivial to acquire such an ideal training set from an unlabeled graph, one that contains sufficient knowledge for distinguishing normal and anomalous nodes, especially when constrained by a limited labeling budget. Active Learning (AL) paves a promising way for addressing the labeling problem, as it enables models to enhance their learning efficiency by actively requesting the labels in training data (campbell2000query, ; settles2009active, ; aggarwal2014active, ; zhang2021alg, ). It has also been applied in anomaly detection tasks (gornitz2013toward, ; ghasemi2011active, ; das2016incorporating, ), aiming at discovering more anomalies based on heuristic query strategies, such as uncertainty-based, diversity-based, and anomaly score-based strategies. In this way, a selection of samples that is more likely to contain a relatively higher proportion of anomalies can be used to fine-tune the model iteratively. However, existing query strategies for anomaly detection are primarily designed for independent and identically distributed data, which hardly consider relationships among samples and may not be well-suited for graph-structured data. Therefore, we urgently need a query strategy specifically for graph data to provide powerful direct supervision signals for anomaly detection.

To leverage the direct and indirect supervision signals efficiently and effectively, we propose a MultItask acTIve Graph Anomaly deTEction framework (MITIGATE). It incorporates external supervision signals from auxiliary tasks and productively exploits direct supervision signals by actively labeling nodes for anomaly detection. Specifically, we first consider a node classification task together with the anomaly detection task. We initialize the multitask framework by node classification and detect out-of-distribution samples (i.e., anomalies) with classification uncertainty. To query more valuable nodes, we then introduce a dynamic informativeness metric that relies on confidence difference and a representativeness metric based on the masked aggregation mechanism. Initially, in the absence of anomalies, we prioritize the informativeness metric to focus more on the classification uncertainty for picking out-of-distribution nodes to label. Afterward, in order to mitigate the variation in predictions for the same node across the two tasks, we consider nodes with greater confidence differences as being more informative. Note that these nodes are not particularly challenging samples, given that at least one of the tasks can correctly identify them. To enhance the diversity of selected nodes and handle the influence of neighbors that have already been labeled for selection, we re-aggregate the intermediate embedding by masking the feature of labeled neighbors and keeping a count of them. The former filters the previously labeled representative features, while the latter reduces the overall neighborhood information based on labeled status, thereby emphasizing the feature of the central nodes. The main contributions of this paper are summarized as follows:

•

We propose MITIGATE, a novel multitask active graph anomaly detection framework to detect anomalies within a limited labeling budget, which actively queries nodes with the guidance of external supervision signals.
•

To query more valuable nodes, we devise a dynamic strategy to measure the informativeness and representativeness of nodes according to the training and labeling status.
•

We conduct comprehensive experiments on four datasets to verify the effectiveness of the proposed method.

2. Problem Definition

In this section, we formulate the problem of active learning for graph anomaly detection.

Let $\mathcal{G}=(\mathcal{V},\mathbf{A},\mathbf{X})$ denotes an attributed graph, where $\mathcal{V}=\left\{v_{1},v_{2},\cdots,v_{n}\right\}$ is the set of nodes, $\mathbf{A}\in{\{0,1\}}^{n\times n}$ is the adjacency matrix and $\mathbf{X}\in\mathbb{R}^{n\times k}$ is the node attribute matrix. Note that anomaly labels are rare in the real world, but a portion of class labels are readily accessible. We denote the set of nodes labeled for classification as $\mathcal{V}_{L}^{N}$ , and their corresponding labels are denoted by $\mathbf{Y}^{N}_{L}\in\mathbb{R}^{n\times C}$ , with $C$ representing the number of classes. $\mathbf{y}_{i}^{N}$ is the one-hot label of $v_{i}$ . $\mathbf{Y}_{L}^{t}\in\{0,1\}$ is the labels for anomaly detection at the $t$ -th iteration, i.e., normal or anomalous. $y_{i}^{A}$ is the labels of $v_{i}$ for anomaly detection. We initialize a set of nodes with classification labels, i.e., $\mathcal{V}_{L}^{0}=\mathcal{V}_{L}^{N}$ , and regard them as normal nodes, $\mathbf{Y}_{L}^{0}=\{0\}$ . The key notations are summarized in Appendix A.

Given an attributed graph $\mathcal{G}$ , a query strategy $\mathcal{Q}$ , labeling budget $\mathcal{B}$ , the goal of an AL-based anomaly detection algorithm is to select a subset of nodes denoted as $\mathcal{S}^{t}$ from the unlabeled node set $\mathcal{V}_{U}^{t-1}$ , and label them in a way that minimizes the loss of the model $\mathcal{M}$ :

(1)

\underset{\mathcal{V}_{L}^{t}}{\mathrm{min}}\ \mathcal{L}(\mathcal{M},\mathcal% {Q}|\mathcal{G},\mathbf{Y}^{t}_{L},\mathbf{Y}^{N}_{L}),

where $\mathcal{V}_{L}^{t}=\mathcal{V}_{L}^{t-1}\cup\mathcal{S}^{t}$ and $\mathcal{V}_{U}^{t}=\mathcal{V}_{U}^{t-1}\setminus\mathcal{S}^{t}$ are the labeled and unlabeled sets after the $t$ -th selection. $t\in\{1,2,...,\mathcal{B}/b\}$ and $b$ is the budget in each iteration. Then, $\mathcal{G}$ , $\mathbf{Y}^{N}_{L}$ , and labels ${y}_{i}^{A}$ for $v_{i}\in\mathcal{V}_{L}^{t}$ are used to train the model $\mathcal{M}$ . For convenience, we define the labeling budget as the maximum number of nodes allowed to be labeled.

Refer to caption — Figure 1. An illustration of the proposed MITIGATE. For a graph $\mathcal{G}$ with partial classification labels, MITIGATE employs a GNN encoder to generate node representations $\mathbf{H}$ , and then employs a node classifier and an anomaly score predictor. In each selection iteration, MITIGATE assesses the representativeness and informativeness of nodes using a distance-based clustering and confidence difference across tasks for anomaly detection, respectively. Then, it picks $b$ nodes from clustering centers with high informative scores and queries an oracle to identify whether they are anomalies or not. Finally, the queried set will be incorporated into the labeled set, and continue training of the model. (a) A 2-dimensional confidence difference space. (b) An example of candidates’ confidence difference. We select the candidates with high confidence difference (i.e., in the upper left corner and lower right corner).

3. Method

In this section, we introduce the proposed MITIGATE framework. Firstly, we give an overview of the whole framework. Then, we elaborate on the selection strategy including the calculation of the distance features for clustering and confidence difference across tasks in Section 3.2. Finally, we introduce the training process of MITIGATE in detail in Section 3.3.

3.1. Overview

Figure 1 illustrates the pipeline of MITIGATE. It utilizes a shared encoder for node representation learning and two decoders for node classification and anomaly score prediction, respectively. Considering the multitask structure, we devise a node informativeness metric based on the confidence difference across tasks. To reduce the initial performance gap, we incorporate classification uncertainty into the informativeness score measurement. To promote diversity in node selection at each step, we employ the K-Medoids algorithm, which treats cluster centers as representative samples, with a novel distance measurement based on masked aggregation. From these centers, we select $b$ nodes with the highest informativeness scores. We then provide the set of selected nodes to an oracle and obtain their labels (e.g., normal or anomalous) for the anomaly detection task. Finally, we combine the selected set with the training set and continue training the model. The details of the encoder, node classifier, and anomaly score predictor are as follows:

3.1.1. Encoder

Due to both the node classifier and anomaly score predictor needing an encoder to reflect the graph topology and node attributes into a latent space, we adopt graph convolutional networks (GCNs) (kipf2016semi, ) to learn expressive node representations, and the layer-wise propagation is defined as:

(2)

\mathbf{H}^{(l+1)}=\sigma(\tilde{\mathbf{D}}^{-\frac{1}{2}}\tilde{\mathbf{A}}% \tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}),

where $\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}$ and $\tilde{\mathbf{D}}_{ii}=\sum_{j}\tilde{\mathbf{A}}_{ij}$ , $\mathbf{I}$ is the identity matrix and $\mathbf{W}^{(l)}$ is the weight matrix at the $l$ -layer.

3.1.2. Node classifier

We use a graph convolutional layer as the node classifier to further preserve the structural information of intermediate node representations as follows:

(3)

\mathbf{Z}=\sigma(\tilde{\mathbf{D}}^{-\frac{1}{2}}\tilde{\mathbf{A}}\tilde{% \mathbf{D}}^{-\frac{1}{2}}\mathbf{H}\mathbf{W}^{N}),

where $\mathbf{H}$ is the final output of the encoder and $\mathbf{W}^{N}$ is the weight matrix for the node classifier. Due to anomalies being more likely to be uncertain in classification, we use entropy as the measure of anomaly probability, where a higher entropy score indicates a greater probability of being anomalous. The anomaly score of $v_{i}$ from the node classifier is

(4)

e_{i}=-\sum_{j}^{C}\mathbf{z}_{ij}\cdot{\log\mathbf{z}_{ij}}.

3.1.3. Anomaly score predictor

The anomaly score predictor is built with a linear transformation along with a sigmoid function based on shared node representations:

(5)

\mathbf{p}=\mathrm{Sigmoid}(\mathbf{W}^{A}\mathbf{H}+b^{A}),

where $\mathbf{p}\in\mathbb{R}^{n}$ is the predicted anomaly scores, $W^{A}\in\mathbb{R}^{1\times n}$ is the weight matrix, and $b^{A}\in\mathbb{R}$ is corresponding bias term.

3.1.4. Hybrid anomaly score

Given that both the node classifier and anomaly score predictor possess the ability to detect anomalies, we adopt a weighted score function to combine the standard scores of two predictions as follows:

(6)

\mathbf{s}=Norm(\mathbf{e})+\phi\cdot Norm(\mathbf{p}),

where $\mathbf{s}\in\mathbb{R}^{n}$ is the overall anomaly score, $Norm(\mathbf{e})=\frac{\mathbf{e}-{\mu}_{e}}{{\sigma}_{e}}$ , ${\mu}_{e}$ is the mean, ${\sigma}_{e}$ is the standard deviation, and $\phi$ is a hyperparameter to balance the importance of two predictions.

3.2. Node Selection

To benefit the overall performance for anomaly detection of the unified framework, we measure the value of nodes in terms of representativeness with distance-based clustering and informativeness with confidence difference.

3.2.1. Distance-based Clustering

To discover representativeness samples from the huge unlabeled data pool, we devise a masked aggregation mechanism for generating distance features that consider representations in latent space and features in the previously labeled set. Several methods (cai2017active, ; gao2018active, ) adopt the Euclidean distance to measure the distance between representations $\mathbf{H}$ when performing clustering. This approach treats information equally among neighbors due to the mean aggregation mechanism in the GCN layer. However, in the distance-based node selection, the distance features should be impacted by the current labeled status in the neighborhood. Specifically, the chosen nodes exhibit representational features. This is one key reason for their selection during previous iterations. Thus, directly aggregating these features may affect the representativeness of central nodes. Therefore, we derive distance features through a masked aggregation mechanism, which considers labeled status in the neighborhood. Initially, in order to mitigate the impact of features pertaining to labeled neighbors, their representations will be masked during the summation of neighborhood information. Furthermore, to accentuate the distinctive features inherent to unlabeled nodes, we calculate the mean of neighborhood information according to the number of neighbors rather than the number of unlabeled neighbors. It suggests that in cases where more neighbors have been annotated, the influence of neighborhood information on the central node features will be diminished. In the $t$ -th selection, the distance features can be formulated as follows:

(7)

\hat{\mathbf{h}}^{t}_{i}=\frac{\mathrm{SUM}(\mathbf{h}_{j},\forall v_{j}\in% \mathcal{N}(v_{i})\cap\mathcal{V}_{U}^{t-1})}{|\mathcal{N}(v_{i})|}+\mathbf{h}% _{i},

where $\mathcal{N}(v_{i})$ is the neighborhood of $v_{i}$ . Therefore, the distance between $v_{i}$ and $v_{j}$ can be calculated as follows:

(8)

dist(v_{i},v_{j})={||\hat{\mathbf{h}}^{t}_{i}-\hat{\mathbf{h}}^{t}_{j}||}_{2}.

To this end, we combine the node features and labeled status of neighbors in the distance function, which can decrease the selection probability due to the high representative of neighbors rather than itself. After calculating the pairwise distance, we adopt K-Medoids clustering as previous studies (wu2019active, ; liu2022lscale, ), in which centers chosen for the candidate set must be real nodes within the graph. Additionally, the number of clusters is set to $m$ . We do not focus on querying more anomalies but on learning a predictive model to effectively distinguish normal and anomalous nodes from the aspects of classification uncertainty and anomaly score.

3.2.2. Confidence Difference

As both the node classifier and anomaly score predictor can identify anomalies, we give a definition for the model confidence in anomaly detection. For the node classifier, the entropy of predictions is used to justify whether a sample is anomalous as Eq. (4). The higher entropy score indicates a higher confidence for a node to be classified as an anomaly. Also, for the anomaly score predictor, the anomaly score is the indicator to describe the level of confidence. The confidence of the node classifier, denoted as $\mathbf{c}^{N}\in\mathbb{R}^{n}$ , and the anomaly score predictor, denoted as $\mathbf{c}^{A}\in\mathbb{R}^{n}$ , are described as follows:

(9)

\mathbf{c}^{N}\propto\mathbf{e},\ \mathbf{c}^{A}\propto\mathbf{p}.

Note that the node classifier and anomaly detector do not always perform equally on the same nodes for anomaly detection at the same stage. For example, a node may receive conflicting predictive discrimination from the node classifier and anomaly detector. Specifically, it has a lower value on classification entropy, which indicates it is likely to be a normal node, while it has a higher anomaly score, which means it is more likely to be anomalous. To eliminate the influence of scale across different tasks, we normalize the entropy of classification and the anomaly scores using Z-scores. The confidence of the node classifier and anomaly score predictor can be reformulated as:

(10)

\mathbf{c}^{N}=Norm(\mathbf{e}),\ \mathbf{c}^{A}=Norm(\mathbf{p}),

where a higher value of $c_{i}^{N}$ and $c_{i}^{A}$ indicates a lower confidence of $v_{i}$ being normal and a higher confidence of being anomalous. Based on this, we can employ the Manhattan distance to quantify the confidence difference as follows:

(11)

\mathbf{d}={|\mathbf{c}^{A}-\mathbf{c}^{N}|}.

The high confidence difference $\mathbf{d}$ indicates the controversy between two decoders. It’s important to note that these samples with conflicting predictions in binary classification are not challenging for model training since one of the tasks can effectively handle them. To align the decoders and facilitate consistent prediction, human experts can provide coherent information by labeling the nodes with high confidence differences. The visualization of confidence differences is shown in Figure 1(a).

Algorithm 1 MITIGATE

0: Graph

\mathcal{G}=(\mathcal{V},\mathbf{A},\mathbf{X})

, query batch size

b

, total budget

\mathcal{B}

, labeled set for node classification

\mathcal{V}^{L}_{N}

, number of clusters

m

0: Anomaly scores

\mathbf{s}

\mathcal{V}^{0}_{L}=\mathcal{V}^{N}_{L}

\mathcal{V}_{U}^{0}=\mathcal{V}\setminus\mathcal{V}^{N}_{L}

;

2: for

t=1,2,...,\mathcal{B}/b

\mathcal{M}=\mathrm{train}(\mathcal{G},\mathcal{V}^{t-1}_{L},\mathcal{V}^{L}_{% N})

;

4: Calculate distance features

\hat{\mathbf{H}}^{t}

with masked agg. by Eq. (7);

5: Calculate

dist(v_{i},v_{j})

for each node in unlabeled set

\mathcal{V}^{t-1}_{U}

;

6: Cluster

\mathcal{V}^{t-1}_{U}

by K-Medoids algorithm with

m

clusters;

7: Calculate confidence difference

\mathbf{d}

and informativeness scores

Info

by Eq. (11) and Eq. (12);

8: Select the top

b

clustering centers as

\mathcal{S}^{t}

Info

;

9: Query an Oracle to obtain labels for

\mathcal{S}^{t}

;

10:

\mathcal{V}^{t}_{L}=\mathcal{V}^{t-1}_{L}\cup\mathcal{S}^{t}

\mathcal{V}^{t}_{U}=\mathcal{V}^{t-1}_{U}\setminus\mathcal{S}^{t}

;

11: end for

12:

\mathcal{M}=\mathrm{train}(\mathcal{G},\mathcal{V}^{\mathcal{B}/b}_{L},% \mathcal{V}^{L}_{N})

;

13: Calculate the overall anomaly scores

\mathbf{s}

by Eq. (6);

3.2.3. Selection

To judiciously select samples suitable for model training and mitigate the influence of the absence of positive samples, we introduce a time-sensitive informativeness measurement. Firstly, as each node can exclusively belong to either the normal or anomalous category, the nodes predicted to have conflicting labels are more likely to contain crucial information essential for one of the tasks. Moreover, due to the scarcity of positive class samples during the initial training stages, anomalous patterns serve as more informative elements for the unified framework. Importantly, even with a subset of class labels, the node classifier can provide an initial prediction for anomalies, which is essential for the anomaly score predictor. We adopt an exponentially decaying parameter $\tau$ to dynamically combine the entropy scores of node classification and confidence difference across tasks during selection (liu2022lscale, ). The informativeness score $Info\in\mathbb{R}^{n}$ is defined as follows:

(12)

Info=\tau^{|\mathcal{V}_{L}^{t}|}{Norm}(\mathbf{e})+(1-\tau^{|\mathcal{V}_{L}^% {t}|}){\mathbf{d}},

where $|\mathcal{V}_{L}^{t}|$ is the number of selected nodes in the $t$ -th iteration, and $\tau$ can be set as a number close to 1.0, e.g., 0.99. In all, the informativeness score is initially influenced more by anomalies, and as training progresses, it shifts the emphasis toward identifying nodes with prediction conflicts from a model-centric perspective.

Recognizing that both the informativeness and representativeness indicate the value of a node, during each iteration, MITIGATE first utilizes the aforementioned distance-based clustering algorithm to choose a subset of high representative nodes, denoted as the candidate set $\mathcal{V}_{C}^{t}$ . Subsequently, a set of $b$ nodes $\mathcal{S}^{t}$ are chosen from $\mathcal{V}_{C}^{t}$ according to their informativeness score. Algorithm 1 describes the selection process together with the model training.

3.3. Model Training

After each iteration of querying, MITIGATE is trained continually by optimizing from three aspects. First, we calculate the cross-entropy loss on the pre-labeled nodes $\mathcal{V}_{L}^{N}$ for node classification, which will not be affected by selection.

(13)

\mathcal{L}_{nc}=-\frac{1}{|\mathcal{V}^{N}_{L}|}\sum_{v_{i}\in\mathcal{V}^{N}% _{L}}\sum_{j=0}^{C}y^{N}_{ij}\log z_{ij}.

where $\mathbf{y}^{N}_{i}$ is the one-hot label of node $v_{i}$ . Next, we employ a weighted binary cross-entropy loss, the widely used supervised loss for imbalanced data, on the labeled set $\mathcal{V}_{L}^{t}$ for anomaly detection at each iteration.

(14)

\mathcal{L}_{ad}=-\frac{1}{|\mathcal{V}^{t}_{L}|}\sum_{v_{i}\in\mathcal{V}^{t}% _{L}}(\gamma^{t}y^{A}_{i}\log p_{i}+(1-y^{A}_{i})\log(1-p_{i})),

where $\gamma^{t}$ is the ratio of anomaly to normal nodes in $\mathcal{V}^{t}_{L}$ .

Note that the node classifier and anomaly score predictor rely on different types of annotations, i.e., the node classifier needs class information while the anomaly score predictor only needs to know whether a node is an anomaly or not. To leverage the information for the classification task from the queried set, in which nodes are only annotated as normal nodes or anomalies, we optimize the uncertainty of classification predictions on the whole labeled set, which can be formulated as:

(15)

\mathcal{L}_{un}=-\frac{1}{|\mathcal{V}^{t}_{LN}|}\sum_{v_{i}\in\mathcal{V}^{t% }_{LN}}\sum_{k}^{C}z_{ik}\log z_{ik}+\frac{1}{|\mathcal{V}^{t}_{LA}|}\sum_{v_{% j}\in\mathcal{V}^{t}_{LA}}\sum_{k}^{C}z_{jk}\log z_{jk},

where $\mathcal{V}^{t}_{LN}$ and $\mathcal{V}^{t}_{LA}$ denotes the normal and anomalous node set at the $t$ -th iteration. In Eq. (15), we minimize the predicted uncertainty for normal nodes (the first term), while maximizing that for anomalies (the second term).

In all, the overall loss function of MITIGATE can be formulated as Eq. (16), where $\alpha$ and $\beta$ are weighting parameters.

(16)

\mathcal{L}=\alpha\cdot\mathcal{L}_{nc}+\beta\cdot\mathcal{L}_{ad}+\mathcal{L}% _{un}.

4. Experiments

In this section, we extensively compare MITIGATE with state-of-the-art methods for anomaly detection.

4.1. Experiments Settings

4.1.1. Datasets

In our experiments, we adopt four widely used datasets with ground-truth labels for node classification. As there is no ground truth of anomaly detection, we inject contextual and structural anomalies following the previous studies (ding2019deep, ; luo2022comga, ; liu2022bond, ) to evaluate the effectiveness for anomaly detection of our method. We use two citation networks, Cora and Citeseer (sen2008collective, ), and two social networks, BlogCatalog and Flickr (tang2009relational, ). The details of the datasets are described in Appendix B.1. For each dataset, we randomly sample 500 nodes as a validation set and 1000 nodes as a test set and fix them for all methods for a fair comparison.

4.1.2. Baselines

We compare MITIGATE with three types of baselines, including (1) out-of-distribution (OOD) detection methods, including GCN-ENT (kipf2016semi, ), GKDE (zhao2020uncertainty, ), OODGAT-ENT and OODGAT-ATT (song2022learning, ), (2) semi-supervised anomaly detection methods, including FdGars (wang2019fdgars, ), GeniePath (liu2019geniepath, ), BWGNN (tang2022rethinking, ), and DAGAD (liu2022dagad, ), and (3) active query strategy for anomaly detection, including most positive query (GCN-Pos), positive diverse query (GCN-PosD) and diverse query (li2023deep, ) (GCN-Div). The details of these methods and their implementation are in Appendix B.2.

Table 1. Overall performance comparison in AUC-ROC(%) and AUC-PR(%) on four datasets. The best results are highlighted in bold, and the second best is underlined.

	Cora		Citeseer		BlogCatalog		Flickr
Method	AUC-ROC	AUC-PR	AUC-ROC	AUC-PR	AUC-ROC	AUC-PR	AUC-ROC	AUC-PR
GCN-ENT	64.82 $\pm$ 1.07	10.48 $\pm$ 0.51	69.36 $\pm$ 0.92	9.30 $\pm$ 0.98	62.57 $\pm$ 0.91	11.70 $\pm$ 1.99	64.62 $\pm$ 0.77	9.51 $\pm$ 1.03
GKDE	72.25 $\pm$ 0.85	15.21 $\pm$ 1.79	74.87 $\pm$ 0.61	17.07 $\pm$ 1.55	44.56 $\pm$ 0.07	5.29 $\pm$ 0.02	44.81 $\pm$ 0.09	5.05 $\pm$ 0.03
OODGAT-ENT	69.70 $\pm$ 3.50	15.11 $\pm$ 6.65	70.01 $\pm$ 4.45	12.22 $\pm$ 9.96	56.52 $\pm$ 2.24	8.94 $\pm$ 1.55	53.52 $\pm$ 1.83	7.60 $\pm$ 0.63
OODGAT-ATT	50.76 $\pm$ 4.71	5.33 $\pm$ 0.73	55.33 $\pm$ 4.95	6.71 $\pm$ 1.63	49.51 $\pm$ 4.18	6.13 $\pm$ 1.17	51.17 $\pm$ 1.24	6.46 $\pm$ 0.93
FdGars	58.55 $\pm$ 6.22	6.51 $\pm$ 1.22	54.61 $\pm$ 9.09	5.12 $\pm$ 1.55	51.99 $\pm$ 3.38	6.01 $\pm$ 0.24	57.95 $\pm$ 6.61	13.97 $\pm$ 8.81
GeniePath	54.42 $\pm$ 6.41	6.41 $\pm$ 0.35	48.89 $\pm$ 4.94	4.94 $\pm$ 0.32	50.25 $\pm$ 1.81	5.80 $\pm$ 0.20	50.26 $\pm$ 1.21	5.97 $\pm$ 0.14
BWGNN	64.77 $\pm$ 0.20	9.44 $\pm$ 0.16	64.42 $\pm$ 0.60	6.36 $\pm$ 0.10	60.18 $\pm$ 1.25	8.94 $\pm$ 0.23	52.22 $\pm$ 1.14	5.96 $\pm$ 0.12
DAGAD	58.30 $\pm$ 0.30	19.17 $\pm$ 4.83	61.52 $\pm$ 3.21	14.10 $\pm$ 1.82	56.52 $\pm$ 2.89	9.03 $\pm$ 0.92	63.22 $\pm$ 0.33	11.66 $\pm$ 4.40
GCN-Pos	57.22 $\pm$ 3.54	9.90 $\pm$ 0.83	59.78 $\pm$ 4.78	8.09 $\pm$ 1.37	62.52 $\pm$ 2.70	9.96 $\pm$ 1.93	55.56 $\pm$ 2.78	7.03 $\pm$ 1.93
GCN-PosD	62.24 $\pm$ 2.07	12.00 $\pm$ 1.11	61.54 $\pm$ 0.68	7.54 $\pm$ 0.21	62.42 $\pm$ 1.63	9.79 $\pm$ 1.77	57.24 $\pm$ 4.85	10.39 $\pm$ 4.86
GCN-Div	61.23 $\pm$ 2.88	9.05 $\pm$ 1.00	56.54 $\pm$ 7.19	10.55 $\pm$ 3.42	65.51 $\pm$ 3.86	11.45 $\pm$ 1.50	66.89 $\pm$ 2.02	10.44 $\pm$ 1.76
MITIGATE-A	73.59 $\pm$ 3.53	17.00 $\pm$ 2.55	71.25 $\pm$ 3.00	20.21 $\pm$ 6.48	63.94 $\pm$ 3.08	12.40 $\pm$ 2.50	68.89 $\pm$ 3.37	15.69 $\pm$ 1.72
MITIGATE-E	72.81 $\pm$ 1.52	15.98 $\pm$ 1.95	74.12 $\pm$ 1.53	14.60 $\pm$ 1.97	63.00 $\pm$ 2.10	12.09 $\pm$ 2.67	68.26 $\pm$ 3.01	14.18 $\pm$ 3.83
MITIGATE	75.45 $\pm$ 2.27	18.80 $\pm$ 2.46	78.03 $\pm$ 0.94	23.32 $\pm$ 6.63	66.20 $\pm$ 3.04	13.60 $\pm$ 2.74	70.16 $\pm$ 3.00	17.33 $\pm$ 1.95

4.1.3. Implementation Detail

In our experiments, we set the max budget $\mathcal{B}=80$ , which means 80 nodes can be labeled, and we set $b=4$ at each iteration. We assume $20k$ nodes have been annotated with classification labeled, where $k$ is the number of classes, and no anomalies as initial. For MITIGATE, we adopt Adam optimizer and the learning rate is set to 0.01. As to $\alpha$ and $\beta$ in the overall loss function (Eq. (16)), $\tau$ in the informativeness score function (Eq. (12)), the number of clusters $m$ in K-Medoids algorithm, we tune the hyperparameters and select the best-performing results according to the validation set, the details are shown in Appendix B.3. We set the maximum epochs of training in each iteration to 300 and perform early-stopping when ( $AUCROC+Accuracy$ ) stops to increase for 20 epochs. $Accuracy$ is the accuracy of the node classifier on in-distribution nodes and $AUCROC$ is the performance evaluation of the anomaly score predictor. We implement two variants of MITIGATE, namely MITIGATE-A and MITIGATE-E. MITIGATE-A uses predicted anomaly scores, while MITIGATE-E employs the entropy of classification as the final score for anomaly detection.

We adopt two widely used complementary measures in previous studies, including the Area Under Receiver Operating Characteristic Curve (AUC-ROC) and Area Under Precision-Recall Curve (AUC-PR). To mitigate results randomness, we run the proposed method and baselines 5 times with different random seeds and record the average scores and standard deviation.

4.2. Evaluation Results

We evaluate the anomaly detection performance of MITIGATE and all compared methods mentioned in Section 4.1.2 on the four datasets. The results are shown in Table 1 and Figure 2.

4.2.1. Overall Comparison

We evaluate the overall performance when the labeling budget $\mathcal{B}$ is set to 80 (i.e., a maximum of 80 nodes can be labeled). The corresponding AUC-ROC and AUC-PR are reported in Table 1. We observe that MITIGATE significantly outperforms other methods under most metrics. (1) Comparison with classification methods. GKDE and OODGAT are two state-of-the-art methods for OOD detection, which only utilize a portion of labeled in-distribution nodes for training. Different from them, our MITIGATE adopts anomalies (i.e., out-of-distribution nodes) in the training process to better learn a decision boundary for anomaly detection. Experimental results demonstrate that MITIGATE outperforms these two methods, with at least 3.2% and 3.6% improvement in AUC-ROC and AUC-PR, respectively. (2) Comparison with semi-supervised anomaly detection methods. FdGars, GeniePath, BWGNN, and DAGAD are superior methods for node anomaly detection. However, these methods heavily rely on labeled data. When the labeling budget is low, which may lead to the absence of anomalies, these methods become frangible. It is evident that MITIGATE outperforms them, particularly in terms of AUC-ROC. (3) Comparison with query strategies for anomaly detection. These query strategies focus on selecting anomalies or high-uncertainty samples in anomaly detection. Though they may choose more anomalies, their performance remains suboptimal due to anomalies not always contributing significantly to the model and the potential challenges posed by uncertain nodes. More specifically, an anomalous sample, sharing similarities with known anomalous patterns, can yield a high anomaly score but may not provide substantially novel information to the model. Besides, hard samples that contain noisy information for model training also exhibit high uncertainty, and selecting such samples potentially hinders the model performance. Unlike them, MITIGATE selects samples based on the confidence difference across tasks. This implies that the chosen samples are informative for at least one task and are not excessively challenging, as one of the tasks can accurately identify them. (4) Comparison with MITIGATE varients. MITIGATE-A, MITIGATE-E, and MITIGATE differ in their final score for anomaly detection. It is observed that the results of using a single indicator do not differ significantly in identifying anomalies in most cases, whereas the weighted sum of these two components consistently offers discernible advantages. This is because MITIGATE-A and MITIGATE-E detect anomalies from a single aspect, namely the uncertainty of classification and the learnable anomaly scores, respectively, without incorporating external information from other facets. Moreover, this observation highlights the effectiveness of integrating classification tasks and anomaly detection tasks in the process of identifying anomalies.

4.2.2. Performance under Different Budget

We evaluate the performance of our proposed MITIGATE and several active query strategies for anomaly detection over the different numbers of labeled nodes for training. The results are shown in Figure 2. Compared with the other baselines, we can observe that MITIGATE achieves the best performance in AUC-ROC under each labeling budget in most of the compared settings. In particular, to achieve the AUC-ROC of 66.8% on Flickr, GCN-Div labels 80 nodes while MITIGATE only labels 16 nodes. We contribute the effectiveness of MITIGATE at a low labeling budget with the classification task, which can provide initial discrimination for anomalies and alleviate the absence of anomalies problem.

4.3. Parameter Analysis

We further analyze the majority of parameters in MITIGATE on Citeseer. The results are shown in Figure 3 and Figure 4.

4.3.1. Impacts of $\alpha$ and $\beta$

We analyze the impact of varying weights in the overall loss function as shown in Eq. (16). These weights are crucial in achieving a balance among the three components of the loss, namely the node classification loss, anomaly detection loss, and uncertainty loss. We evaluate MITIGATE for $\alpha\in\{0.4,0.5,0.8,1.0,1.25,2,2.5\}$ and $\beta\in\{0.4,0.5,0.8,1.0,1.25,2,2.5\}$ and report the average results of 5 runs in Figure 3. We can observe that lower values of $\alpha$ and higher values of $\beta$ lead to better performances. One possible reason is that the category characteristics in Citeseer are easy to differentiate. It further suggests that adjusting $\alpha$ and $\beta$ properly can bring more benefits to the overall performance.

4.3.2. Impacts of $\phi$

The weight term $\phi$ is essential in calculating the final anomaly scores by balancing the importance of classification uncertainty and learnable anomaly scores in Eq. (6). Figure 4(a) presents the AUC-ROC and AUC-PR of MITIGATE when varying $\phi\in\{1.6,1.8,2,2.2,2.4\}$ . It is evident that although both terms are capable of identifying anomalies, their contributions are not equal. The optimal weight ratio between classification uncertainty and learnable anomaly score for the final anomaly score is 1:2 (i.e., $\phi=2$ ), indicating the anomaly predictor exerts a more pronounced influence in identifying anomalies.

4.3.3. Impacts of $m$

As clustering plays a pivotal role in choosing representative nodes from the unlabeled set, we investigate the performance of MITIGATE by varying the number of clusters $m$ from 1 to 6 times the class number. As shown in Figure 4(b), we can observe that the optimal value of $m$ tends to be near $4k$ , which leads to the optimal AUC-ROC and AUC-PR.

Table 2. Ablation study on Citeseer and Flickr. The best results are highlighted in bold.

Dataset	Variants	AUC-ROC	AUC-PR
Citeseer	w/o uncertainty loss	72.33 $\pm$ 1.55	10.80 $\pm$ 1.71
	w/o entropy score	76.35 $\pm$ 3.77	15.75 $\pm$ 4.77
	w/o confidence difference	73.62 $\pm$ 1.77	14.79 $\pm$ 3.85
	w/o masked aggregation	71.95 $\pm$ 4.94	11.81 $\pm$ 2.94
	w/o clustering	69.33 $\pm$ 1.87	21.22 $\pm$ 1.69
	MITIGATE	78.03 $\pm$ 0.94	23.32 $\pm$ 6.63
Flickr	w/o uncertainty loss	66.13 $\pm$ 0.95	10.43 $\pm$ 1.45
	w/o entropy score	65.51 $\pm$ 3.12	10.06 $\pm$ 1.82
	w/o confidence difference	64.85 $\pm$ 1.62	12.76 $\pm$ 1.63
	w/o masked aggregation	67.57 $\pm$ 3.58	12.62 $\pm$ 2.44
	w/o clustering	68.48 $\pm$ 2.04	12.45 $\pm$ 3.01
	MITIGATE	70.16 $\pm$ 3.00	17.33 $\pm$ 1.95

4.4. Ablation Study

We conduct an ablation study to examine the contribution of each key component in the proposed framework on Citeseer and Flickr. The results are shown in Table 2.

•

w/o uncertainty loss: it removes the uncertainty loss on selected nodes in Eq. (16).
•

w/o entropy score: it removes the classification uncertainty score in informativeness measurement in Eq. (12).
•

w/o confidence difference: it removes the confidence difference between tasks in informativeness measurement in Eq. (12).
•

w/o masked aggregation: it replaces distance features calculated through masked aggregation as Eq. (7) with the representation in latent space obtained from Eq. (2).
•

w/o clustering: it removes the distance-based clustering, which aims to discover representative nodes.

4.4.1. Uncertainty Loss

As shown in Table 2, we see that MITIGATE notably outperforms MITIGATE w/o uncertainty loss, exhibiting improvements of 5.7% and 12.5% in terms of AUC-ROC and AUC-PR, respectively. It indicates that incorporating the uncertainty loss from the classification perspective on selected nodes, which are exclusively labeled as normal or anomalous, can substantially improve the anomaly detection performance of MITIGATE.

4.4.2. Strategy of Node Selection

To verify the effectiveness of the proposed selection strategy, we conduct ablation tests on four variants. First, compared with MITIGATE w/o entropy score and MITIGATE w/o confidence difference, which remove partial of the informativeness score respectively, MITIGATE achieves the best performance. It indicates that both the classification uncertainty and confidence difference contribute to the node informativeness, and the dynamic combination of them is also effective. This is expected since the uncertainty loss is not a direct supervision for the node classifier and the performance of the anomaly score predictor performs worse at the beginning without anomalous samples. Besides, the comparison between MITIGATE w/o masked aggregation and MITIGATE w/o clustering indicates that the clustering makes a minimal or even negative contribution without efficient distance measurement. For instance, MITIGATE w/o clustering surpasses MITIGATE w/o masked aggregation by approximately 10% in terms of AUC-PR on Citeseer. Furthermore, when comparing MITIGATE and MITIGATE w/o masked aggregation, it is obvious that the proposed masked aggregation effectively improves the discriminative capability in anomaly detection. We argue that by masking representations for previously labeled nodes within the neighborhood, more representative nodes are selected in relation to both labeled and unlabeled sets.

5. Related Work

5.1. Active Learning for Anomaly Detection

Active learning aims to interactively select samples from unlabeled data to maximize model performance within limited labeling budgets. It has been extensively studied in the field of anomaly detection (gornitz2013toward, ; ghasemi2011active, ; das2016incorporating, ; zha2020meta, ). In contrast to traditional active learning, active anomaly detection focuses on discovering more anomalous samples. Several methods incorporate active queried data with the unsupervised learning paradigm in the training process. For instance, (gornitz2013toward, ) suggests querying samples close to the decision boundary, which means more uncertainty. (ghasemi2011active, ) suggests querying samples based on density to ensure a diverse distribution of predicted anomalous data for querying. Devising an effective query strategy to discover more anomalies becomes an important problem. (das2016incorporating, ; ning2022deep, ) aim to query the most anomalous sample according to predicted scores, while (das2019active, ) incorporates the density information with anomaly score. However, these greedy strategies prioritize short-term gains and may yield suboptimal results. To address this issue, deep reinforcement learning has been introduced to the design of query strategies. For example, Meta-AAD (zha2020meta, ) trains a meta-policy using auxiliary labeled datasets so that it can be directly applied to new unlabeled datasets without further tuning. In our work, we focus on improving the predictive performance for graph anomaly detection by leveraging corresponding auxiliary tasks to help the model training and sample querying processes.

5.2. Graph Anomaly Detection

Graph anomaly detection methods (wang2019fdgars, ; wang2019semi, ; tang2022rethinking, ; chai2022can, ) have been extensively studied in recent years, which assumes that a set of nodes have been labeled. Motivated by general GNN algorithms, several graph anomaly detection methods based on redesigned message passing and aggregation mechanisms have been proposed. For example, FdGars (wang2019fdgars, ) utilizes GCN to combine the characteristics of reviewers and their relationships. Semi-GNN (wang2019semi, ) employs hierarchical attention to model the multi-view graph for fraud detection. GraphConsis (liu2020alleviating, ) proposes to filter inconsistent neighbors in aggregation to maintain the unique semantic characteristics of the target node. Moreover, the imbalance between the majority and minority classes is another essential problem in anomaly detection. To alleviate this problem, PC-GNN (liu2021pick, ) incorporates label distribution information to sample neighbors in the aggregation process, while DAGAD (liu2022dagad, ) devises a graph data augmentation module to fertilize training set with generated samples. Different from the aforementioned spatial methods, BWGNN (tang2022rethinking, ) leverages spectral distribution information to capture graph anomalies with high-frequency and AMNet (chai2022can, ) adaptively combines signals of low-frequency and high-frequency to learn the node embedding for distinguishing the anomalous nodes. However, these methods may not work well when the labeling budget is relatively low since they highly rely on the initial labeled set, i.e., at least one negative sample needs to be labeled. Whereas, we focus on actively selecting nodes for annotation during the training process, which means the selected labeled nodes can bring more benefits to the target model.

5.3. Out-of-distribution Detection

Out-of-distribution (OOD) detection aims to distinguish samples drawn from a distribution different from the labeled in-distribution samples (zhou2021step, ). Extensive studies (liang2018enhancing, ; yu2019unsupervised, ; sehwag2020ssd, ; zhou2021contrastive, ) have been proposed for OOD detection on structured data, such as text and images. However, graph data contains not only structured attributes but also non-Euclidean topology structures. Consequently, several graph-specific OOD detection methods have been proposed, including uncertainty-based methods (zhao2020uncertainty, ; stadler2021graph, ) and graph learning-based methods (song2022learning, ; huang2022end, ). Uncertainty-based methods aim to use the confidence of a well-trained model on partial ID samples. GKDE (zhao2020uncertainty, ) adopts several uncertainty estimates-based metrics for OOD detection and finds the vacuity-based model has optimal performance, while GPN (stadler2021graph, ) devises an input-independent Bayesian update rule to model uncertainty on predicted categorical distribution. Besides, graph learning-based methods argue that the OOD samples can affect ID samples under the message-passing mechanism. For example, OODGAT (song2022learning, ) explicitly models the interaction between inliners and outliers, while LMN (huang2022end, ) learns to mix neighbors to mitigate the influence from OOD nodes. However, these methods have a limitation in that they cannot effectively utilize the supervision information from OOD samples.

6. Conclusion

In this paper, we proposed MITIGATE, a novel multitask active learning framework for graph anomaly detection within a limited labeling budget. MITIGATE comprises a node classifier and an anomaly score predictor, with the primary goal of enhancing the overall performance of anomaly detection through the selection of the queried set. Besides, the masked aggregation mechanism for distance features proves instrumental in selecting representative nodes. Additionally, node informativeness is dynamically measured by the confidence difference across tasks and classification uncertainty. Experimental results on four datasets demonstrate the effectiveness of MITIGATE compared with the state-of-the-art methods.

References

[1] C. C. Aggarwal, X. Kong, Q. Gu, J. Han, and S. Y. Philip. Active learning: A survey. In Data classification, pages 599–634. Chapman and Hall/CRC, 2014.
[2] H. Cai, V. W. Zheng, and K. C.-C. Chang. Active learning for graph embedding. arXiv preprint arXiv:1705.05085, 2017.
[3] C. Campbell, N. Cristianini, A. Smola, et al. Query learning with large margin classifiers. In ICML, volume 20, page 0, 2000.
[4] Z. Chai, S. You, Y. Yang, S. Pu, J. Xu, H. Cai, and W. Jiang. Can abnormality be detected by graph neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, pages 23–29, 2022.
[5] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008.
[6] S. Das, M. R. Islam, N. K. Jayakodi, and J. R. Doppa. Active anomaly detection via ensembles: Insights, algorithms, and interpretability. arXiv preprint arXiv:1901.08930, 2019.
[7] S. Das, W.-K. Wong, T. Dietterich, A. Fern, and A. Emmott. Incorporating expert feedback into active anomaly discovery. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 853–858. IEEE, 2016.
[8] K. Ding, J. Li, R. Bhanushali, and H. Liu. Deep anomaly detection on attributed networks. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 594–602. SIAM, 2019.
[9] K. Ding, Q. Zhou, H. Tong, and H. Liu. Few-shot network anomaly detection via cross-network meta-learning. In Proceedings of the Web Conference 2021, pages 2448–2456, 2021.
[10] Y. Dou, Z. Liu, L. Sun, Y. Deng, H. Peng, and P. S. Yu. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM international conference on information & knowledge management, pages 315–324, 2020.
[11] Y. Dou, K. Shu, C. Xia, P. S. Yu, and L. Sun. User preference-aware fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2051–2055, 2021.
[12] L. Gao, H. Yang, C. Zhou, J. Wu, S. Pan, and Y. Hu. Active discriminative network representation learning. In IJCAI International Joint Conference on Artificial Intelligence, 2018.
[13] A. Ghasemi, H. R. Rabiee, M. Fadaee, M. T. Manzuri, and M. H. Rohban. Active learning from positive and unlabeled data. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 244–250. IEEE, 2011.
[14] N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld. Toward supervised anomaly detection. Journal of Artificial Intelligence Research, 46:235–262, 2013.
[15] T. Huang, D. Wang, and Y. Fang. End-to-end open-set semi-supervised node classification with out-of-distribution detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, 2022.
[16] D. Jin, Z. Yu, P. Jiao, S. Pan, D. He, J. Wu, S. Y. Philip, and W. Zhang. A survey of community detection approaches: From statistical modeling to deep learning. IEEE Transactions on Knowledge and Data Engineering, 35(2):1149–1170, 2021.
[17] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[18] A. Li, C. Qiu, M. Kloft, P. Smyth, S. Mandt, and M. Rudolph. Deep anomaly detection under labeling budget constraints. In International Conference on Machine Learning, pages 19882–19910. PMLR, 2023.
[19] S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018.
[20] F. Liu, X. Ma, J. Wu, J. Yang, S. Xue, A. Beheshti, C. Zhou, H. Peng, Q. Z. Sheng, and C. C. Aggarwal. Dagad: Data augmentation for graph anomaly detection. In 2022 IEEE International Conference on Data Mining (ICDM), pages 259–268. IEEE, 2022.
[21] J. Liu, Y. Wang, B. Hooi, R. Yang, and X. Xiao. Lscale: Latent space clustering-based active learning for node classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 55–70. Springer, 2022.
[22] K. Liu, Y. Dou, Y. Zhao, X. Ding, X. Hu, R. Zhang, K. Ding, C. Chen, H. Peng, K. Shu, et al. Bond: Benchmarking unsupervised outlier node detection on static attributed graphs. Advances in Neural Information Processing Systems, 35:27021–27035, 2022.
[23] Y. Liu, X. Ao, Z. Qin, J. Chi, J. Feng, H. Yang, and Q. He. Pick and choose: a gnn-based imbalanced learning approach for fraud detection. In Proceedings of the web conference 2021, pages 3168–3177, 2021.
[24] Y. Liu, Z. Li, S. Pan, C. Gong, C. Zhou, and G. Karypis. Anomaly detection on attributed networks via contrastive self-supervised learning. IEEE transactions on neural networks and learning systems, 33(6):2378–2392, 2021.
[25] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, L. Song, and Y. Qi. Geniepath: Graph neural networks with adaptive receptive paths. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4424–4431, 2019.
[26] Z. Liu, Y. Dou, P. S. Yu, Y. Deng, and H. Peng. Alleviating the inconsistency problem of applying graph neural network to fraud detection. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 1569–1572, 2020.
[27] X. Luo, J. Wu, A. Beheshti, J. Yang, X. Zhang, Y. Wang, and S. Xue. Comga: Community-aware attributed graph anomaly detection. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 657–665, 2022.
[28] J. Ning, L. Chen, C. Zhou, and Y. Wen. Deep active autoencoders for outlier detection. Neural Processing Letters, pages 1–13, 2022.
[29] S. Noekhah, N. binti Salim, and N. H. Zakaria. Opinion spam detection: Using multi-iterative graph-based model. Information processing & management, 57(1):102140, 2020.
[30] V. Sehwag, M. Chiang, and P. Mittal. Ssd: A unified framework for self-supervised outlier detection. In International Conference on Learning Representations, 2020.
[31] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
[32] B. Settles. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
[33] D. B. Skillicorn. Detecting anomalies in graphs. In 2007 IEEE Intelligence and Security Informatics, pages 209–216. IEEE, 2007.
[34] Y. Song and D. Wang. Learning on graphs with out-of-distribution nodes. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1635–1645, 2022.
[35] M. Stadler, B. Charpentier, S. Geisler, D. Zügner, and S. Günnemann. Graph posterior network: Bayesian predictive uncertainty for node classification. Advances in Neural Information Processing Systems, 34:18033–18048, 2021.
[36] J. Tang, J. Li, Z. Gao, and J. Li. Rethinking graph neural networks for anomaly detection. In International Conference on Machine Learning, pages 21076–21089. PMLR, 2022.
[37] L. Tang and H. Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817–826, 2009.
[38] D. Wang, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, and Y. Qi. A semi-supervised graph attentive network for financial fraud detection. In 2019 IEEE International Conference on Data Mining (ICDM), pages 598–607. IEEE, 2019.
[39] J. Wang, R. Wen, C. Wu, Y. Huang, and J. Xiong. Fdgars: Fraudster detection via graph convolutional networks in online app review system. In Companion proceedings of the 2019 World Wide Web conference, pages 310–316, 2019.
[40] Y. Wu, Y. Xu, A. Singh, Y. Yang, and A. Dubrawski. Active learning for graph neural networks via node feature propagation. arXiv preprint arXiv:1910.07567, 2019.
[41] L. Yang, Z. Liu, Y. Dou, J. Ma, and P. S. Yu. Consisrec: Enhancing gnn for social recommendation via consistent neighbor aggregation. In Proceedings of the 44th international ACM SIGIR conference on Research and development in information retrieval, pages 2141–2145, 2021.
[42] Q. Yu and K. Aizawa. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9518–9526, 2019.
[43] D. Zha, K.-H. Lai, M. Wan, and X. Hu. Meta-aad: Active anomaly detection with deep reinforcement learning. In 2020 IEEE International Conference on Data Mining (ICDM), pages 771–780. IEEE, 2020.
[44] Q. Zhang, K. Yu, Z. Guo, S. Garg, J. J. Rodrigues, M. M. Hassan, and M. Guizani. Graph neural network-driven traffic forecasting for the connected internet of vehicles. IEEE Transactions on Network Science and Engineering, 9(5):3015–3027, 2021.
[45] W. Zhang, Y. Shen, Y. Li, L. Chen, Z. Yang, and B. Cui. Alg: Fast and accurate active learning framework for graph convolutional networks. In Proceedings of the 2021 International Conference on Management of Data, pages 2366–2374, 2021.
[46] Y. Zhang and Q. Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
[47] X. Zhao, F. Chen, S. Hu, and J.-H. Cho. Uncertainty aware semi-supervised learning on graph data. Advances in Neural Information Processing Systems, 33:12827–12836, 2020.
[48] W. Zhou, F. Liu, and M. Chen. Contrastive out-of-distribution detection for pretrained transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
[49] Z. Zhou, L.-Z. Guo, Z. Cheng, Y.-F. Li, and S. Pu. Step: Out-of-distribution detection in the presence of limited in-distribution labeled data. Advances in Neural Information Processing Systems, 34:29168–29180, 2021.

Appendix A Notations

Here we list the key notations in our paper in Table 3.

Table 3. Summary of notations.

Notation	Description
$\mathcal{G},\mathbf{A},\mathbf{X},\mathcal{V}$	Graph, adjacency matrix, attribute matrix, node set.
$\mathcal{B},b$	The total budget, query batch size.
$\mathcal{V}_{L}^{N}$	The set of nodes labeled for classification.
$\mathcal{V}_{L}^{t}$	The set of labeled nodes at the $t$ -th iteration.
$\mathcal{V}_{U}^{t}$	The set of unlabeled nodes at the $t$ -th iteration.
$\mathcal{S}^{t}$	The set of selected nodes at the $t$ -th iteration.
$\mathbf{e}$	The entropy of predictions from node classifier.
$\mathbf{p}$	The anomaly score from anomaly score predictor.
$\mathbf{s}$	The overall anomaly score.
$\mathbf{h}_{i}^{t}$	The latent embedding of $v_{i}$ at the $t$ -th iteration.
$\hat{\mathbf{h}}_{i}^{t}$	The distance feature of $v_{i}$ at the $t$ -th iteration.
$dist(v_{i},v_{j})$	The masked distance between $v_{i}$ and $v_{j}$ .
$\mathbf{c}^{N},\mathbf{c}^{A}$	The confidence of node classifier and anomaly score predictor in identifying anomalies.
$\mathbf{d}$	The confidence difference.
$Info$	The informativeness score.
$\tau$	The decaying parameter in informativeness score.
$\phi$	The weight term in overall anomaly score.
$\alpha$ , $\beta$	Weight of classification and anomaly detection loss.
$m$	The number of clusters.

Appendix B DETAILS OF EXPERIMENTAL SETTINGS

B.1. Datasets

We employ four datasets to evaluate the performance of MITIGATE. Cora and Citeseer [31] are two citation networks, in which nodes represent papers while edges represent citation relationships among papers. The class labels denote corresponding academic fields. BlogCatalog and Flickr [37] are two social networks, in which nodes represent users while edges represent social relationships between users. The class labels denote the interests of users. The dataset statistics are shown in Table 4.

Following the previous studies [8, 24, 22, 20], we inject contextual and structural anomalies into four widely used datasets: Cora, CiteSeer, Flickr, and BlogCatalog. (1) Structural anomalies. The idea behind synthetic structural anomalies is that anomalies can manifest as densely connected nodes within small cliques [33]. For every clique, $p$ nodes are randomly selected and interconnected completely. Repeating this step $q$ times to generate $q$ cliques, ultimately injecting $p\times q$ structural anomalies. In our experiments, we fix $p=15$ and set $q$ to $10,15,5,5$ for BlogCatalog, Flickr, Cora, and CiteSeer, respectively. (2) Contextual anomalies. The contextual anomalies are injected by perturbing the attributes of nodes. Initially, a node $v_{i}$ is randomly selected from the vertex set $V$ . Then, $k$ additional nodes are sampled from $V\setminus v_{i}$ , forming a subset $V_{c}$ . For each node $v_{c}\in V_{c}$ , we measure the Euclidean distance to $v_{i}$ and subsequently adjust the attributes of $v_{i}$ to be the same as $v_{c}$ with the largest distance. In our experiments, we set $k=50$ and the number of contextual anomalies as $p\times q$ , following previous studies to ensure an adequately large disturbance magnitude and to balance the proportions of different anomaly types [24].

Table 4. Dataset Analysis

Dataset	#Nodes	#Edges	#Attributes	#Class	#Anomalies
Cora	2,708	5,429	1,433	7	150
CiteSeer	3,327	4,732	3,703	6	150
BlogCatalog	5,196	171,743	8,189	6	300
Flickr	7,575	239,738	12,047	9	450

B.2. Baselines

In our experiments, the compared methods include (1) OOD detection methods (GCN-ENT, GKDE, OODGAT-ENT and OODGAT-ATT), (2) semi-supervised anomaly detection methods (FdGars, GeniePath, BWGNN and DAGAD), (3) active query strategy for anomaly detection (most positive query, positive diverse query, and diverse query). The details of the baselines are as follows:

•

GCN-ENT [17] is a vanilla GCN method that identifies anomalous nodes by uncertainty measured based on the entropy of node classification.
•

GKDE [47] is a Graph-based Kernel Dirichlet GCN method for semi-supervised node classification and OOD detection.
•

OODGAT [34] is a graph learning method with OOD nodes to distinguish inliers from outliers during feature propagation. In particular, OODGAT-ENT uses the entropy of the predicted distribution as an indicator of outliers, while OODGAT-ATT relies on the score provided by a binary classifier.
•

FdGars [39] is an anomaly detection method based on GCN that expresses the characteristics of reviewers and relationships between reviewers.
•

GeniePath [25] proposes to adaptively select receptive paths for different nodes in aggregation for anomaly detection.
•

BWGNN [36] analyzes graph anomalies in the spectral domain and leverages Beta graph wavelet to better capture anomaly information on graphs.
•

DAGAD [20] devises an augmentation module that fertilizes the training set with generated samples to alleviate the scarcity of anomalous samples.
•

Most positive query selects the top-k samples ordered by their anomaly scores.
•

Positive diverse query combines anomaly scores with distance-based diversification into querying.
•

Diverse query [18] utilizes k-means++ to initialize diverse clusters. The probability of another query from the unlabeled set is proportional to its distance to the closest sample already in the query set.

We implement GKDE ¹¹1https://github.com/zxj32/uncertainty-GNN, OODGAT ²²2https://github.com/SongYYYY/KDD22-OODGAT, BWGNN ³³3https://github.com/squareRoot3/Rethinking-Anomaly-Detection and DAGAD ⁴⁴4https://github.com/FanzhenLiu/DAGAD using the code published by their authors. For FdGars and GeniePath, we use the code provided by an open-source library ⁵⁵5https://github.com/safe-graph/DGFraud for graph anomaly detection. Regarding other active query strategies, we adopt the code ⁶⁶6https://github.com/aodongli/Active-SOEL provided by [18], and use the same network structure as the encoder and anomaly score predictor in MITIGATE.

B.3. Implementation Notes

We conduct all experiments on a Linux server with 64G RAM, and 2 NVIDIA GeForce RTX 2080TI with 11GB GPU memory. We implement MITIGATE with Python 3.8.1 and PyTorch 2.0.1. We report our hyperparameter settings that are tuned on the validation set with 5 runs in Table 5.

Table 5. Hyperparameter settings of MITIGATE.

	Cora	Citeseer	BlogCatalog	Flickr
$\tau$	0.95	0.90	0.98	0.98
$\alpha$	1.25	0.50	1.25	1.25
$\beta$	0.50	2.00	1.00	0.50
$\phi$	1.25	2.00	1.00	0.50
$m$	24	24	18	27