11institutetext: 1 Department of Information Engineering, Electronics and Telecommunications (DIET), “Sapienza” University of Rome, Via Eudossiana 18, 00184, Rome, Italy.
2 Department of Physics, “Sapienza” University of Rome, Piazzale A. Moro 5, 00185, Rome, Italy.
3 INFN Sezione di Roma, Piazzale Aldo Moro, 5, Rome, 00185, Italy, UE.
4 Department of Physics, University of Liverpool, Oxford Street Liverpool, L69 7ZE, United Kingdom.
Email: {alessio.verdone; alessio.devoto; stefano.giagu; simone.scardapane; massimo.panella}@uniroma1.it; {cristiano.sebastiani; joseph.carmignani; Monica.D’Onofrio}@cern.ch

Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques

A. Verdone1    A. Devoto1    C. Sebastiani4    J. Carmignani4    M. D’Onofrio4    S. Giagu2,3    S. Scardapane1,3 and M. Panella1
Abstract

The experiments at the Large Hadron Collider at CERN generate vast amounts of complex data from high-energy particle collisions. This data presents significant challenges due to its volume and complex reconstruction, necessitating the use of advanced analysis techniques for analysis. Recent advancements in deep learning, particularly Graph Neural Networks, have shown promising results in addressing the challenges but remain computationally expensive. The study presented in this paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline aiming at improving the accuracy and efficiency of collision event prediction tasks. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples and then we refined the dataset by removing non-contributory elements: the model trained on this new reduced dataset can achieve good performances at a reduced computational cost. The method is completely agnostic to the specific influence method: different influence modalities can be easily integrated into our methodology. Moreover, by analyzing the discarded elements we can provide further insights about the event classification task. The novelty of integrating data attribution techniques together with Graph Neural Networks in high-energy physics tasks can offer a robust solution for managing large-scale data problems, capturing critical patterns, and maximizing accuracy across several high-data demand domains.

Keywords:
Graph Neural Networks, High-energy physics, Data Attribution method

1 Introduction

The Large Hadron Collider (LHC) at CERN provides high-energy particle beams for experiments like ATLAS [1], which generate vast amounts of data from collisions. These collisions produce a vast array of particles that are detected by sophisticated experimental apparatus. The data collected are of extreme importance for understanding the fundamental nature of matter and the universe. However, the sheer scale and complexity of the data pose significant challenges for efficient and accurate analysis [6]. For example, the output from ATLAS event reconstruction can generate a data stream of more than 3.5 terabytes per second [3]: this enormous amount of data is then processed and analyzed by teams of scientists and researchers who use a variety of techniques and algorithms to extract meaningful information from the data. The complexity of the data is further exacerbated by the presence of missing values, outliers, and noisy data points, which can lead to inaccurate and biased results [15, 7, 43]. The analysis of the data requires a deep understanding of the underlying physics and the ability to identify patterns and relationships that may not be immediately apparent, hence necessitating a high degree of expertise and specialized knowledge. In recent years, machine learning and deep learning techniques have shown very promising results in addressing the challenges posed by the LHC data [20]. These methods have been successfully applied to a range of tasks, including particle identification [46], event reconstruction [36], and background subtraction [10]. However, the complexity and scale of the data require the development of more sophisticated and scalable methods that can effectively handle the data volume generated by the LHC experiments. By representing the data as a graph, where particles and their interactions are denoted by nodes and edges respectively, Graph Neural Networks (GNNs) can learn high-level representations of the data that capture the complex relationships between particles and their correlations [37, 24]. This enables more accurate and efficient analyses, as well as the ability to identify patterns and relationships that may not be apparent through traditional methods. Although GNNs, such as any deep learning model, are beneficial for the analysis of large datasets, if excessively large, the computational time and efficiency of these models become costly and possibly prohibitive. Data attribution methods have emerged as crucial tools in machine learning and data analysis scenarios [12, 22, 32, 25] to resolve these challenges by offering insights into the inner workings of complex models and shedding light on the factors that drive the predictions at the sample level. These methods provide a mean to understand the importance (or influence) of data points in driving the output of a model, thereby enhancing interpretability and trustworthiness. Data attribution methods trace a model behavior back to its training dataset, offering an effective approach to better understanding “black-box” neural networks. Several methods for the detection of the influence of training data have been proposed in the last years like Trak [33], SimFluence [21] or Datamodels [23]. One of the most important methods is TracIn [34]; it utilizes loss gradients to generate relationship scores of influences between training and testing samples. It can be used also to generate influence scores between the elements of the training set, which is a useful feature for discovering anomalies inside the training dataset. Recently, several works have applied successfully influence analyses on large-scale generative AI tasks, such as diffusion models and large language modells (LLMs) [44, 29, 40, 50, 19]. The considered scenarios are all characterized by a huge amount of data. For example, in image classification tasks they can identify most confounding images in the training set, (e.g., multiple classes of objects inside a single image) or the presence of wrong labels inside the datasets. Removing harmful or redundant elements enhances overall performance, improving both classification metrics and reducing computational costs.

Refer to caption
Figure 1: Event collision represented in a 2D plane with φ𝜑\varphiitalic_φ and η𝜂\etaitalic_η as axis. The φ𝜑\varphiitalic_φ-η𝜂\etaitalic_η plane in an LHC experiment is a coordinate system used to describe the angular distribution of particles, where η𝜂\etaitalic_η measures the particle’s angle relative to the beam axis and φ𝜑\varphiitalic_φ represents the azimuthal angle around the beam axis. Edges of fully connected graph are not shown for clarity.

To the best of our knowledge, existing literature lacks methods that effectively combine training data attribution techniques with graph data and graph neural networks. Furthermore, the domain of high-energy physics provides a robust testing ground for evaluating our approach to tackling complex real-world problems. Carefully selecting elements for the training set not only facilitates the management of large-scale data, which is a common challenge in physics and numerous other fields but also enables the capture of crucial patterns and relationships within the data. This approach maximizes the predictive accuracy and generalization capability across diverse domains and applications.

In this study, we integrated efficient influence analysis into the graph classification process to enhance the accuracy and efficiency of predictions regarding the collision events of high-energy particles. The pipeline can be summarized in three steps. Initially, a GNN model is used in the first training stage to classify event collision types. Then, using training checkpoints, TracIn identifies the influence relationships between samples, allowing us to discover which training samples contribute positively or negatively to the collision classification task. Finally, we can remove or replace the training dataset elements that do not improve the task and train a GNN on a selected and reduced dataset for the same task. Our experiments were conducted using a vast and extensive dataset of simulated particle collisions. This approach also enables us to perform an explainability analysis of the problem. By comprehending the characteristics of the discarded elements, as well as those defined as significant for the problem, we gain insight into both the prediction model and the subsequent downstream task. Our contribution can be resumed as:

  • Integration of Influence Analysis in Classification: We incorporated TracIn, a data attribution method, into the classification process of particle collision events using GNNs. This allowed us to identify and refine the training dataset by removing non-contributory samples, enhancing the efficiency and accuracy of the classification task.

  • Improved Performance and Reduced Computational Costs: By refining the dataset and focusing on significant training samples, we improved overall classification performance and reduced computational costs. This approach led to better utilization of resources and more accurate predictions.

  • Enhanced Explainability and Insights: Our method provides a detailed explainability analysis, offering insights into the characteristics of influential data elements. This not only helps in understanding the prediction model better but also aids in managing large-scale data by capturing critical patterns and relationships within the data.

Refer to caption
Figure 2: Our proposed methodology: we initially train the GNN network on the original full-size dataset or a subset of it. Then, we employ the saved checkpoints to compute influence values on training data: values with a higher score will be filtered out. We obtain a distilled dataset on which we perform the final training.

2 Related works

2.1 Graph Neural Networks

GNNs are a type of neural network designed to work with graph-structured data, able to learn representations that capture complex relationships between nodes. They are mathematical models that can be easily adapted to different tasks in the domain of the graphs, like node or graph classification and regression, edge prediction, graph generation or node clustering. They found several applications in real scenarios, such as in community detection, social network analysis, molecular property prediction or knowledge graph generation; moreover, they have found a wide scope for application in the field of particle physics [37] and high-energy physics (HEP) (e.g., particle tracking and reconstruction [16, 24]). The physics tasks of the LHC present many potential applications where graph neural networks have been successfully applied [14]. [17] employed a GNN for the determination of charged particle trajectories in collisions. [31] tackled the pileup mitigation problem, the presence of parasitic low-transverse-momentum collisions, by employing a three-layer Gated Graph Neural Networks with residual connections. More recently, [9] proposed a rotation-equivariant, with respect to rotations around the jet axis, GNN to extract novel phenomena in the standard model effective field theory (SMEFT) context from LHC collision data

2.2 Data attribution

Training Data Attribution (TDA) methods aim to understand the influence or importance of individual training data points on the predictions made by a machine learning model, identifying data points’ influence on the model’s output. Influence estimation approaches can be divided into two main classes: retraining-based and gradient-based [22]. Retraining-based methods assess the influence of training data by repeatedly retraining the model using different subsets of the training set, while gradient-based influence estimators determine influence by analyzing the alignment of training and test instance gradients, either throughout the training process or at its conclusion. Retraining-based methods comprehend the simplest and more computationally expensive leave-one-out (LOO) [42] or downsampling [18]. More interestingly are Gradient-based methods: they typically provide closed-form TDA scores by employing gradients in an efficient and scalable way. [28] was one of the first works in this field, by approximating the real influence effect of a training point by employing the gradients of the loss functions. TracIn [34] traces loss changes on test points during the training process, while TRAK [33] uses the neural tangent kernel with random projection to assess influence. These gradient-based methods have significantly reduced computational costs compared to retraining-based methods. However, they typically rely on the assumption of a first-order approximation of the loss, which can lead to performance degradation on neural networks [5, 4] and be more sensitive to randomness associated with model weight initialization and training mechanisms [25]. The latest approach in the TDA scenario demonstrated the effectiveness of ensembling in improving TDA scores with gradient-based methods to solve these typical issues [13, 12]. Ensembling usually involves applying the TDA method to many independently trained models (e.g., averaging the final TDA scores or aggregating some intermediate terms for score calculation). Despite their effectiveness, these ensembling methods require a substantial number of ensembles to perform well, a constraint that requires an important computational cost.

2.3 Data distillation

Data distillation refers to the process of carefully choosing which data points to include in the training set for a deep learning model, as the quality and distribution of the training data can significantly impact the model’s performance or computational resources needed [35]. It involves summarizing or compressing a large dataset into a smaller, more manageable subset while retaining the most essential information needed for training models. This process aims to maintain the performance of models trained on the distilled data, ensuring that they perform similarly to models trained on the full dataset. Data distillation methods can be categorized into four main types. Meta-model matching [41, 30] optimizes the transferability of models trained on distilled data to the original dataset. Gradient matching [47, 49] aligns the gradients of training and distilled datasets to ensure similar model performance. Trajectory matching [11, 8] aims to match the training trajectories of models on distilled and full datasets. Distribution matching [48, 39] directly aligns the statistical distributions of the distilled and original datasets. These methods create high-fidelity, compressed datasets that retain essential information for effective machine learning model training and inference. Influence functions have not yet been used for direct dataset distillation, but they have been employed together in some similar tasks. [45] has used a distilled dataset with a reverse gradient matching technique to approximate the computation of influence values of a smaller dataset achieving promising results. [34] shows the effectiveness of TracIn methods by identifying mislabeled data and filtering them out of the dataset.

Refer to caption
Figure 3: Complete graphs with kinematic features as nodes.
Particle F1 F2 F3 F4 F5 F6
jet1 pTj1superscriptsubscript𝑝Tj1p_{\mathrm{T}}^{\mathrm{j1}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT j1 end_POSTSUPERSCRIPT ηj1superscript𝜂j1\eta^{\mathrm{j1}}italic_η start_POSTSUPERSCRIPT j1 end_POSTSUPERSCRIPT ϕj1superscriptitalic-ϕj1\phi^{\mathrm{j1}}italic_ϕ start_POSTSUPERSCRIPT j1 end_POSTSUPERSCRIPT j1quantilesubscriptj1quantile\mathrm{j1_{quantile}}j1 start_POSTSUBSCRIPT roman_quantile end_POSTSUBSCRIPT - -
jet2 pTj2superscriptsubscript𝑝Tj2p_{\mathrm{T}}^{\mathrm{j2}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT j2 end_POSTSUPERSCRIPT ηj2superscript𝜂j2\eta^{\mathrm{j2}}italic_η start_POSTSUPERSCRIPT j2 end_POSTSUPERSCRIPT ϕj2superscriptitalic-ϕj2\phi^{\mathrm{j2}}italic_ϕ start_POSTSUPERSCRIPT j2 end_POSTSUPERSCRIPT j2quantilesubscriptj2quantile\mathrm{j2_{quantile}}j2 start_POSTSUBSCRIPT roman_quantile end_POSTSUBSCRIPT - -
jet3 pTj3superscriptsubscript𝑝Tj3p_{\mathrm{T}}^{\mathrm{j3}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT j3 end_POSTSUPERSCRIPT ηj3superscript𝜂j3\eta^{\mathrm{j3}}italic_η start_POSTSUPERSCRIPT j3 end_POSTSUPERSCRIPT ϕj3superscriptitalic-ϕj3\phi^{\mathrm{j3}}italic_ϕ start_POSTSUPERSCRIPT j3 end_POSTSUPERSCRIPT j3quantilesubscriptj3quantile\mathrm{j3_{quantile}}j3 start_POSTSUBSCRIPT roman_quantile end_POSTSUBSCRIPT - -
b1 pTb1superscriptsubscript𝑝Tb1p_{\mathrm{T}}^{\mathrm{b1}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT b1 end_POSTSUPERSCRIPT ηb1superscript𝜂b1\eta^{\mathrm{b1}}italic_η start_POSTSUPERSCRIPT b1 end_POSTSUPERSCRIPT ϕb1superscriptitalic-ϕb1\phi^{\mathrm{b1}}italic_ϕ start_POSTSUPERSCRIPT b1 end_POSTSUPERSCRIPT b1quantilesubscriptb1quantile\mathrm{b1_{quantile}}b1 start_POSTSUBSCRIPT roman_quantile end_POSTSUBSCRIPT b1msubscriptb1m\mathrm{b1_{m}}b1 start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT -
b2 pTb2superscriptsubscript𝑝Tb2p_{\mathrm{T}}^{\mathrm{b2}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT b2 end_POSTSUPERSCRIPT ηb2superscript𝜂b2\eta^{\mathrm{b2}}italic_η start_POSTSUPERSCRIPT b2 end_POSTSUPERSCRIPT ϕb2superscriptitalic-ϕb2\phi^{\mathrm{b2}}italic_ϕ start_POSTSUPERSCRIPT b2 end_POSTSUPERSCRIPT b2quantilesubscriptb2quantile\mathrm{b2_{quantile}}b2 start_POSTSUBSCRIPT roman_quantile end_POSTSUBSCRIPT b2msubscriptb2m\mathrm{b2_{m}}b2 start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT -
lepton pTl1superscriptsubscript𝑝Tl1p_{\mathrm{T}}^{\mathrm{l1}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT l1 end_POSTSUPERSCRIPT ηl1superscript𝜂l1\eta^{\mathrm{l1}}italic_η start_POSTSUPERSCRIPT l1 end_POSTSUPERSCRIPT ϕl1superscriptitalic-ϕl1\phi^{\mathrm{l1}}italic_ϕ start_POSTSUPERSCRIPT l1 end_POSTSUPERSCRIPT - - -
energy ETMisssuperscriptsubscript𝐸TMissE_{\mathrm{T}}^{\mathrm{Miss}}italic_E start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Miss end_POSTSUPERSCRIPT - ϕETMisssuperscriptitalic-ϕETMiss\phi^{\mathrm{ETMiss}}italic_ϕ start_POSTSUPERSCRIPT roman_ETMiss end_POSTSUPERSCRIPT - - -
Figure 4: Features [2] exploited for each node.

3 Methodology

Our method proposes to integrate TracIn, an important data attribution method, with GNNs to enhance ATLAS analyses event classification tasks, improving performance and interpretability. We developed a three-step method: initially, we train a GNN model to classify collision event types, then we use TracIn to identify influence scores in training samples; finally, we re-train the model on a selected subset. Training elements that don’t positively contribute to the classification task are then removed, improving classification metrics and reducing computational costs.

3.1 Problem formulation

The ‘SUSY dataset’ [2] contains Monte Carlo simulated collision events recorded with the ATLAS experiment, representing signals over a large background with observable kinematic features. Two types of events are considered:

  • Signal𝑆𝑖𝑔𝑛𝑎𝑙Signalitalic_S italic_i italic_g italic_n italic_a italic_l: SuSy Dark Matter Monte Carlo candidate events

  • Background𝐵𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑Backgrounditalic_B italic_a italic_c italic_k italic_g italic_r italic_o italic_u italic_n italic_d: SM backgrounds form single top and top-antitop processes.

An example of signal event is presented in Fig. 1. The main task involves recognizing rare signals over large backgrounds from the Standard Model processes. To recognize them, we have kinematic features that offer discriminating power in solving the task. The particle collision event can be represented as a graph G𝐺Gitalic_G: the GNN takes G𝐺Gitalic_G as input and outputs its probabilities Y𝑌Yitalic_Y over classes 0, background, or 1, signal. The collision events can be represented as fully connected graphs with 6/7 nodes, N𝑁Nitalic_N, and a maximum of 6 features, i.e., XRN×6𝑋superscriptR𝑁6X\in\mathrm{R}^{N\times 6}italic_X ∈ roman_R start_POSTSUPERSCRIPT italic_N × 6 end_POSTSUPERSCRIPT. The particle features [2] introduced are: the transverse momentum pTisuperscriptsubscript𝑝𝑇𝑖p_{T}^{i}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the angular variables φisuperscript𝜑𝑖\varphi^{i}italic_φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and ηisuperscript𝜂𝑖\eta^{i}italic_η start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the missing transverse momentum ETmisssuperscriptsubscript𝐸𝑇𝑚𝑖𝑠𝑠E_{T}^{miss}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_s italic_s end_POSTSUPERSCRIPT, the mass bim𝑏subscript𝑖𝑚bi_{m}italic_b italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and the jet flavor probability jquantilesubscript𝑗𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒j_{quantile}italic_j start_POSTSUBSCRIPT italic_q italic_u italic_a italic_n italic_t italic_i italic_l italic_e end_POSTSUBSCRIPT. By defining the graph as fully connected, we can define its adjacency matrix ARN×N𝐴superscriptR𝑁𝑁A\in\mathrm{R}^{N\times N}italic_A ∈ roman_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT. Then, G𝐺Gitalic_G can be alternatively expressed as G=(A,X)𝐺𝐴𝑋G=(A,X)italic_G = ( italic_A , italic_X ). A graph representation and a table representing the features employed for each node is presented in Fig. 4 and Tab. 4.

3.2 Framework’s workflow

Preliminary training.

The GNN baseline for our experiments consists of a sequence of 2-layer Graph Convolutional operators (GConv𝐺𝐶𝑜𝑛𝑣GConvitalic_G italic_C italic_o italic_n italic_v) [27], a global mean pooling operator, GlobMeanPool𝐺𝑙𝑜𝑏𝑀𝑒𝑎𝑛𝑃𝑜𝑜𝑙GlobMeanPoolitalic_G italic_l italic_o italic_b italic_M italic_e italic_a italic_n italic_P italic_o italic_o italic_l and a final linear layer Lin𝐿𝑖𝑛Linitalic_L italic_i italic_n; we used ReLU as non-linear activation. Formally, the model can be expressed as:

y^=Lin(GlobMeanPool(GConv2(GConv1(X,A),A)))^𝑦𝐿𝑖𝑛𝐺𝑙𝑜𝑏𝑀𝑒𝑎𝑛𝑃𝑜𝑜𝑙𝐺𝐶𝑜𝑛subscript𝑣2𝐺𝐶𝑜𝑛subscript𝑣1𝑋𝐴𝐴\hat{y}=Lin(GlobMeanPool(GConv_{2}(GConv_{1}(X,A),A)))over^ start_ARG italic_y end_ARG = italic_L italic_i italic_n ( italic_G italic_l italic_o italic_b italic_M italic_e italic_a italic_n italic_P italic_o italic_o italic_l ( italic_G italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_G italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_X , italic_A ) , italic_A ) ) ) (1)

Once defined the model, the first step of our workflow consists of training it on the original full training set or a randomly selected subset of it: we call these approaches GNN-FT and GNN-RST respectively. This step is essential since it allows us to collect training checkpoints that are later used by the TracIn method to generate influence scores. Moreover, the experimental results obtained from both GNN-FT and GNN-RST will serve as metrics of comparison with our method.

Influence-based training.

We employ the TracIn [34] method as a baseline for estimating training data influence scores. It assigns an influence score to each training sample to determine its impact on the dataset. It generates influence scores via a scalable and efficient implementation: a first-order gradient approximation is performed to the exact computation of the influence values to reduce the computational cost, it utilizes checkpoints to more efficiently reproduce the training process, and finally, we choose the final layer for computing the loss gradients, i.e., the last linear layer. All these characteristics make TracIn an optimal candidate for data-intensive scenarios. For each training sample, we compute the influence score of it for itself: these values take the name of Self-influence (SI). By representing the loss function l𝑙litalic_l, having k𝑘kitalic_k checkpoints available, learning rate η𝜂\etaitalic_η, trainable weights wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the j𝑗jitalic_j-th layer and training sample x𝑥xitalic_x, the Self-influence score can be computed as follow:

SelfInfluence(x)=i=1kηil(wji,x)l(wji,x)𝑆𝑒𝑙𝑓𝐼𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒𝑥superscriptsubscript𝑖1𝑘subscript𝜂𝑖𝑙subscript𝑤subscript𝑗𝑖𝑥𝑙subscript𝑤subscript𝑗𝑖𝑥SelfInfluence(x)=\sum_{i=1}^{k}\eta_{i}{\nabla l(w_{j_{i}},x)}{\nabla l(w_{j_{% i}},x)}italic_S italic_e italic_l italic_f italic_I italic_n italic_f italic_l italic_u italic_e italic_n italic_c italic_e ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_l ( italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x ) ∇ italic_l ( italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x ) (2)

It traces how a training point influences its own prediction: high values of self-influence scores correspond to the most diverging samples, potential outliers, mislabeled data, or more general samples with contrasting behavior. Self-influence scores have been used previously for finding mislabeled or confounding images and unsupervised anomaly detection tasks [34, 38]. The main idea for our method is that by removing harmful, superfluous, or counterproductive samples, from the model and the task point of view, we can both increase model accuracy and computational efficiency. Once we have a self-influence score for each training sample, we filter out the one with the highest values and the remaining will constitute the final training set. In this way, training elements that do not positively contribute to the classification task are removed and the final dataset will contribute to increase classification metrics and reducing computational costs. In the final step, we train the GNN baseline on the influence-based reduced training set: we named this approach GNN-IRST.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: AUROC score profile varying percentages of the initial randomly selected dataset, using 0.8 (a) and no (b) thresholds on influence values.
Table 1: Best performance metrics of GNN-baseline trained on full-size dataset (FT), on random-selected training subset (RST) and influence-random-selected training subset (IRST).
Method %Train Samples % Influence Samples % Total Train Samples Accuracy F1-score Precision Recall AUROC
GNN-FT 100 100 74.32 ± 0.7 72.96 ± 1.6 77.27 ± 3.4 69.65 ± 5.8 82.92 ± 0.7
GNN-RST 80 80 73.57 ± 1.2 72.58 ± 2.2 75.96 ± 4.8 70.47 ± 8.4 82.67 ± 0.9
GNN-IRST (Our) 80 80 64 74.67 ± 1.0 74.18 ± 1.7 74.99 ± 3.7 74.20 ± 7.8 82.33 ± 0.9
GNN-IRST (Our) 90 80 72 74.76 ± 0.7 73.59 ± 2.0 77.43 ± 3.5 70.72 ± 6.4 82.95 ± 0.8

4 Experimental setup

We performed several experiments to validate the efficiency of our proposed methodology. In the following we will show the experimental setup, the numerical and visual results.

4.1 Experiments’ Parameters

Our experiments included the exploration of various parameters, each contributing uniquely to the refinement of our methodology. The metrics used for the method’s evaluation were Precision, Recall, F1-score, Accuracy, and AUROC. We utilized the Adam [26] optimizer with a learning rate of 11031superscript1031\cdot 10^{-3}1 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and a weight decay of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Numerical results are presented as the mean and standard deviation from 4 executions with different random seeds. The experiments ran for a maximum of 300 epochs, with 16,000 samples.

We evaluated the performance using different subset percentages of the training set: 95%, 90%, 80%, 50%, 20%, 10%, and 5%. Additionally, we examined influence-based subsets with percentages of 100%, 80%, 50%, and 20%. This comprehensive analysis allowed us to assess how training set composition and influence-based selection impact the GNN model’s performance, providing insights into the efficiency and effectiveness of our method. We tested three different setups:

  • GNN-FT: GNN model trained on the full train set.

  • GNN-RST: GNN model trained on a random-selected subset of the original train set.

  • GNN-IRST (Our): GNN model trained on the influence-based random subset of the train set.

Testing and metrics comparisons are conducted relative to the original test set size, serving as the benchmark for evaluating performance and assessing the effectiveness of various methods.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: F1-score profile varying percentages of the initial randomly selected dataset, using 0.5 (a) and 0.8 (b) thresholds on influence values.
Table 2: Best performance metrics of GNN-baseline trained on different training set setups and subset percentage.
Method % Train sample % Influence Samples % Original trainset Accuracy F1-score Precision Recall AUROC
GNN-FT 100 100 74.32 ± 0.7 72.96 ± 1.6 77.27 ± 3.4 69.65 ± 5.8 82.92 ± 0.7
GNN-RST 95 95 73.40 ± 1.5 71.75 ± 1.7 76.93 ± 3.2 67.40 ± 3.9 81.98 ± 1.0
GNN-IRST 80 76 73.40 ± 1.1 73.40 ± 2.2 72.27 ± 2.8 76.72 ± 6.0 81.55 ± 1.6
GNN-RST 90 90 73.60 ± 0.9 72.50 ± 2.1 76.14 ± 3.7 69.89 ± 6.5 82.78 ± 1.0
GNN-IRST 80 72 74.76 ± 0.7 73.59 ± 2.0 77.43 ± 3.5 70.72 ± 6.4 82.95 ± 0.8
GNN-RST 80 80 73.57 ± 1.2 72.58 ± 2.2 76.00 ± 4.8 70.47 ± 8.4 82.67 ± 0.9
GNN-IRST 80 64 74.67 ± 1.0 74.18 ± 1.7 74.99 ± 3.7 74.20 ± 7.8 82.33 ± 0.9
GNN-RST 50 50 72.92 ± 1.1 71.25 ± 2.3 76.00 ± 2.7 67.37 ± 5.5 81.07 ± 1.6
GNN-IRST 80 40 73.06 ± 1.0 74.18 ± 1.6 71.28 ± 0.1 77.57 ± 4.0 81.25 ± 1.7

4.2 Experimental results

Primary numerical results are presented in Tab. 1: we compare GNN-FT and GNN-RST results with our two best-performing models. In terms of overall accuracy on the selected metrics, our method consistently achieves good results, showcasing its robustness and reliability. By employing only 64% of the original training set we’re able to achieve a higher F1-score and Recall score with respect to GNN-FT, while with a percentage of 72% we achieve also better accuracy, Precision and AUROC score. Moreover, this last setup is able to outperform for all the metrics the GNN-RST methodology, using also 8% less training data. In Tab. 2 we perform a deep comparison of GNN-IRST performance with respect to GNN-FT and GNN-RST with different training subset percentages. For almost all the metrics and setups, an influence-based selection mechanism achieves better results with respect to the random one: this shows how a careful choice of training data can increase the overall performance of this method. We show some plots representing the evolution of AUROC and F1-score metric respectively at different data percentages in Fig. 5 and Fig. 6 respectively. In Fig. 5 we can observe a slight improvement of the AUROC metric by employing 20% less training sample with respect to the random-based subset selection. In Fig. 6a and Fig. 6b we can observe instead better performances for almost all subset percentage values; moreover, here we achieve better F1-score results with respect to the full-size training set setup at many different subset percentage values.

5 Conclusions

Utilizing data-attribution methods for data selection tasks proves to be an efficient strategy for enhancing the performance of GNN models in scenarios defined by a large amount of information. By precisely selecting a smaller subset of the original dataset based on self-influence values, our method can achieve competitive or superior results compared to approaches which use the full training set or a random reduced version of it. Carefully selecting training data can significantly improve accuracy as well as the computational and operational costs of data-intensive problems, as proven with LHC experiments. This precise selection also aids in the effective downstream processing of vast amounts of data. Moving forward, our efforts will focus on the development of a custom architecture able to manage and leverage the full potential of influence information or also on exploring and implementing various data influence methods: these techniques not only expand the traditional learning framework but also enhance the interpretability of the inner workings of these networks. This attribute is particularly demanded in intricate and critical scenarios like high-energy physics.

Acknowledgment

The work is partly funded by the European Union’s CHIST-ERA programme under grant agreement CHIST-ERA-19-XAI-009 (MUCCA). AV is supported in part by the Italian Ministry of University and Research (MUR), which funded his PhD as per the Ministerial Decree no. 1061/2021. SG is supported by PNRR MUR project PE0000013-FAIR. SS is partly funded by Sapienza grants RM1221816BD028D6 (DESMOS) and RG123188B3EF6A80 (CENTS).

References

  • [1] ATLAS Collaboration: The ATLAS Experiment at the CERN Large Hadron Collider. JINST 3, S08003 (2008)
  • [2] ATLAS Collaboration: Search for direct production of electroweakinos in final states with one lepton, jets and missing transverse momentum in pp collisions at s𝑠\sqrt{s}square-root start_ARG italic_s end_ARG = 13 TeV with the ATLAS detector. JHEP 12, 167 (2023)
  • [3] ATLAS Collaboration: Software and computing for Run 3 of the ATLAS experiment at the LHC (4 2024)
  • [4] Bae, J., Ng, N., Lo, A., Ghassemi, M., Grosse, R.B.: If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems 35, 17953–17967 (2022)
  • [5] Basu, S., Pope, P., Feizi, S.: Influence functions in deep learning are fragile. arXiv preprint arXiv:2006.14651 (2020)
  • [6] Bird, I.: Computing for the Large Hadron Collider. Ann. Rev. Nucl. Part. Sci. 61, 99–118 (2011)
  • [7] Buss, T., Dillon, B.M., Finke, T., Krämer, M., Morandini, A., Mück, A., Oleksiyuk, I., Plehn, T.: What’s anomalous in lhc jets? SciPost Physics 15(4), 168 (2023)
  • [8] Cazenavette, G., Wang, T., Torralba, A., Efros, A.A., Zhu, J.Y.: Dataset distillation by matching training trajectories. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4750–4759 (2022)
  • [9] Chatterjee, S., Sánchez Cruz, S., Schöfbeck, R., Schwarz, D.: Rotation-equivariant graph neural network for learning hadronic smeft effects. Physical Review D 109(7), 076012 (2024)
  • [10] Crochet, P., Braun-Munzinger, P.: Investigation of background subtraction techniques for high mass dilepton physics. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 484(1-3), 564–572 (2002)
  • [11] Cui, J., Wang, R., Si, S., Hsieh, C.J.: Scaling up dataset distillation to imagenet-1k with constant memory. In: International Conference on Machine Learning. pp. 6565–6590. PMLR (2023)
  • [12] Dai, Z., Gifford, D.K.: Training data attribution for diffusion models (2023)
  • [13] Deng, J., Li, T.W., Zhang, S., Ma, J.: Efficient ensembles improve training data attribution (2024)
  • [14] DeZoort, G., Battaglia, P.W., Biscarat, C., Vlimant, J.R.: Graph neural networks at the large hadron collider. Nature Reviews Physics 5(5), 281–303 (2023)
  • [15] Dillon, B.M., Favaro, L., Plehn, T., Sorrenson, P., Krämer, M.: A normalized autoencoder for lhc triggers. SciPost Physics Core 6(4), 074 (2023)
  • [16] Duarte, J., Vlimant, J.R.: Graph Neural Networks for Particle Tracking and Reconstruction, chap. Chapter 12, pp. 387–436. https://www.worldscientific.com/doi/abs/10.1142/9789811234033_0012
  • [17] Elabd, A., et. al.: Graph neural networks for charged particle tracking on fpgas. Frontiers in Big Data 5 (Mar 2022), http://dx.doi.org/10.3389/fdata.2022.828666
  • [18] Feldman, V., Zhang, C.: What neural networks memorize and why: Discovering the long tail via influence estimation (2020)
  • [19] Georgiev, K., Vendrow, J., Salman, H., Park, S.M., Madry, A.: The journey, not the destination: How data guides diffusion models. arXiv preprint arXiv:2312.06205 (2023)
  • [20] Guest, D., Cranmer, K., Whiteson, D.: Deep learning and its application to lhc physics. Annual Review of Nuclear and Particle Science 68, 161–181 (2018)
  • [21] Guu, K., Webson, A., Pavlick, E., Dixon, L., Tenney, I., Bolukbasi, T.: Simfluence: Modeling the influence of individual training examples by simulating training runs. arXiv preprint arXiv:2303.08114 (2023)
  • [22] Hammoudeh, Z., Lowd, D.: Training data influence analysis and estimation: A survey. Machine Learning 113(5), 2351–2403 (2024)
  • [23] Ilyas, A., Park, S.M., Engstrom, L., Leclerc, G., Madry, A.: Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622 (2022)
  • [24] Ju, X., Farrell, S., Calafiura, P., Murnane, D., Prabhat, Gray, L., Klijnsma, T., Pedro, K., Cerati, G., Kowalkowski, J., Perdue, G., Spentzouris, P., Tran, N., Vlimant, J.R., Zlokapa, A., Pata, J., Spiropulu, M., An, S., Aurisano, A., Hewes, V., Tsaris, A., Terao, K., Usher, T.: Graph neural networks for particle reconstruction in high energy physics detectors (2020)
  • [25] K, K., Søgaard, A.: Revisiting methods for finding influential examples (2021)
  • [26] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [27] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)
  • [28] Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions (2020)
  • [29] Kwon, Y., Wu, E., Wu, K., Zou, J.: Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. arXiv preprint arXiv:2310.00902 (2023)
  • [30] Loo, N., Hasani, R., Amini, A., Rus, D.: Efficient dataset distillation using random feature approximation. Advances in Neural Information Processing Systems 35, 13877–13891 (2022)
  • [31] Martínez, J.A., Cerri, O., Spiropulu, M., Vlimant, J., Pierini, M.: Pileup mitigation at the large hadron collider with graph neural networks. The European Physical Journal Plus 134(7), 333 (2019)
  • [32] Nohyun, K., Choi, H., Chung, H.W.: Data valuation without training of a model. In: The Eleventh International Conference on Learning Representations (2022)
  • [33] Park, S.M., Georgiev, K., Ilyas, A., Leclerc, G., Madry, A.: Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186 (2023)
  • [34] Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems 33, 19920–19930 (2020)
  • [35] Sachdeva, N., McAuley, J.: Data distillation: A survey (2023)
  • [36] Staszewski, R., Chwastowski, J.: Transport simulation and diffractive event reconstruction at the lhc. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 609(2-3), 136–141 (2009)
  • [37] Thais, S., Calafiura, P., Chachamis, G., DeZoort, G., Duarte, J., Ganguly, S., Kagan, M., Murnane, D., Neubauer, M.S., Terao, K.: Graph neural networks in particle physics: Implementations, innovations, and challenges (2022)
  • [38] Thimonier, H., Popineau, F., Rimmel, A., Doan, B.L., Daniel, F.: Tracinad: Measuring influence for anomaly detection. In: 2022 International Joint Conference on Neural Networks (IJCNN). pp. 1–6. IEEE (2022)
  • [39] Wang, K., Zhao, B., Peng, X., Zhu, Z., Yang, S., Wang, S., Huang, G., Bilen, H., Wang, X., You, Y.: Cafe: Learning to condense dataset by aligning features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12196–12205 (2022)
  • [40] Wang, S.Y., Efros, A.A., Zhu, J.Y., Zhang, R.: Evaluating data attribution for text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7192–7203 (2023)
  • [41] Wang, T., Zhu, J.Y., Torralba, A., Efros, A.A.: Dataset distillation. arXiv preprint arXiv:1811.10959 (2018)
  • [42] Weisberg, S., Cook, R.D.: Residuals and influence in regression (1982)
  • [43] Woźniak, K.A., Belis, V., Puljak, E., Barkoutsos, P., Dissertori, G., Grossi, M., Pierini, M., Reiter, F., Tavernelli, I., Vallecorsa, S.: Quantum anomaly detection in the latent space of proton collision events at the lhc. arXiv preprint arXiv:2301.10780 (2023)
  • [44] Xie, T., Li, H., Bai, A., Hsieh, C.J.: Data attribution for diffusion models: Timestep-induced bias in influence estimation. arXiv preprint arXiv:2401.09031 (2024)
  • [45] Ye, J., Yu, R., Liu, S., Wang, X.: Distilled datamodel with reverse gradient matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11954–11963 (2024)
  • [46] Ypsilantis, T., Séguinot, J.: Particle identification for lhc-b: a dedicated collider b experiment at the lhc. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 368(1), 229–233 (1995)
  • [47] Zhao, B., Bilen, H.: Dataset condensation with differentiable siamese augmentation. In: International Conference on Machine Learning. pp. 12674–12685. PMLR (2021)
  • [48] Zhao, B., Bilen, H.: Dataset condensation with distribution matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6514–6523 (2023)
  • [49] Zhao, B., Mopuri, K.R., Bilen, H.: Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929 (2020)
  • [50] Zheng, X., Pang, T., Du, C., Jiang, J., Lin, M.: Intriguing properties of data attribution on diffusion models. arXiv preprint arXiv:2311.00500 (2023)