2 Department of Physics, “Sapienza” University of Rome, Piazzale A. Moro 5, 00185, Rome, Italy.
3 INFN Sezione di Roma, Piazzale Aldo Moro, 5, Rome, 00185, Italy, UE.
4 Department of Physics, University of Liverpool, Oxford Street Liverpool, L69 7ZE, United Kingdom.
Email: {alessio.verdone; alessio.devoto; stefano.giagu; simone.scardapane; massimo.panella}@uniroma1.it; {cristiano.sebastiani; joseph.carmignani; Monica.D’Onofrio}@cern.ch
Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques
Abstract
The experiments at the Large Hadron Collider at CERN generate vast amounts of complex data from high-energy particle collisions. This data presents significant challenges due to its volume and complex reconstruction, necessitating the use of advanced analysis techniques for analysis. Recent advancements in deep learning, particularly Graph Neural Networks, have shown promising results in addressing the challenges but remain computationally expensive. The study presented in this paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline aiming at improving the accuracy and efficiency of collision event prediction tasks. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples and then we refined the dataset by removing non-contributory elements: the model trained on this new reduced dataset can achieve good performances at a reduced computational cost. The method is completely agnostic to the specific influence method: different influence modalities can be easily integrated into our methodology. Moreover, by analyzing the discarded elements we can provide further insights about the event classification task. The novelty of integrating data attribution techniques together with Graph Neural Networks in high-energy physics tasks can offer a robust solution for managing large-scale data problems, capturing critical patterns, and maximizing accuracy across several high-data demand domains.
Keywords:
Graph Neural Networks, High-energy physics, Data Attribution method1 Introduction
The Large Hadron Collider (LHC) at CERN provides high-energy particle beams for experiments like ATLAS [1], which generate vast amounts of data from collisions. These collisions produce a vast array of particles that are detected by sophisticated experimental apparatus. The data collected are of extreme importance for understanding the fundamental nature of matter and the universe. However, the sheer scale and complexity of the data pose significant challenges for efficient and accurate analysis [6]. For example, the output from ATLAS event reconstruction can generate a data stream of more than 3.5 terabytes per second [3]: this enormous amount of data is then processed and analyzed by teams of scientists and researchers who use a variety of techniques and algorithms to extract meaningful information from the data. The complexity of the data is further exacerbated by the presence of missing values, outliers, and noisy data points, which can lead to inaccurate and biased results [15, 7, 43]. The analysis of the data requires a deep understanding of the underlying physics and the ability to identify patterns and relationships that may not be immediately apparent, hence necessitating a high degree of expertise and specialized knowledge. In recent years, machine learning and deep learning techniques have shown very promising results in addressing the challenges posed by the LHC data [20]. These methods have been successfully applied to a range of tasks, including particle identification [46], event reconstruction [36], and background subtraction [10]. However, the complexity and scale of the data require the development of more sophisticated and scalable methods that can effectively handle the data volume generated by the LHC experiments. By representing the data as a graph, where particles and their interactions are denoted by nodes and edges respectively, Graph Neural Networks (GNNs) can learn high-level representations of the data that capture the complex relationships between particles and their correlations [37, 24]. This enables more accurate and efficient analyses, as well as the ability to identify patterns and relationships that may not be apparent through traditional methods. Although GNNs, such as any deep learning model, are beneficial for the analysis of large datasets, if excessively large, the computational time and efficiency of these models become costly and possibly prohibitive. Data attribution methods have emerged as crucial tools in machine learning and data analysis scenarios [12, 22, 32, 25] to resolve these challenges by offering insights into the inner workings of complex models and shedding light on the factors that drive the predictions at the sample level. These methods provide a mean to understand the importance (or influence) of data points in driving the output of a model, thereby enhancing interpretability and trustworthiness. Data attribution methods trace a model behavior back to its training dataset, offering an effective approach to better understanding “black-box” neural networks. Several methods for the detection of the influence of training data have been proposed in the last years like Trak [33], SimFluence [21] or Datamodels [23]. One of the most important methods is TracIn [34]; it utilizes loss gradients to generate relationship scores of influences between training and testing samples. It can be used also to generate influence scores between the elements of the training set, which is a useful feature for discovering anomalies inside the training dataset. Recently, several works have applied successfully influence analyses on large-scale generative AI tasks, such as diffusion models and large language modells (LLMs) [44, 29, 40, 50, 19]. The considered scenarios are all characterized by a huge amount of data. For example, in image classification tasks they can identify most confounding images in the training set, (e.g., multiple classes of objects inside a single image) or the presence of wrong labels inside the datasets. Removing harmful or redundant elements enhances overall performance, improving both classification metrics and reducing computational costs.
To the best of our knowledge, existing literature lacks methods that effectively combine training data attribution techniques with graph data and graph neural networks. Furthermore, the domain of high-energy physics provides a robust testing ground for evaluating our approach to tackling complex real-world problems. Carefully selecting elements for the training set not only facilitates the management of large-scale data, which is a common challenge in physics and numerous other fields but also enables the capture of crucial patterns and relationships within the data. This approach maximizes the predictive accuracy and generalization capability across diverse domains and applications.
In this study, we integrated efficient influence analysis into the graph classification process to enhance the accuracy and efficiency of predictions regarding the collision events of high-energy particles. The pipeline can be summarized in three steps. Initially, a GNN model is used in the first training stage to classify event collision types. Then, using training checkpoints, TracIn identifies the influence relationships between samples, allowing us to discover which training samples contribute positively or negatively to the collision classification task. Finally, we can remove or replace the training dataset elements that do not improve the task and train a GNN on a selected and reduced dataset for the same task. Our experiments were conducted using a vast and extensive dataset of simulated particle collisions. This approach also enables us to perform an explainability analysis of the problem. By comprehending the characteristics of the discarded elements, as well as those defined as significant for the problem, we gain insight into both the prediction model and the subsequent downstream task. Our contribution can be resumed as:
-
–
Integration of Influence Analysis in Classification: We incorporated TracIn, a data attribution method, into the classification process of particle collision events using GNNs. This allowed us to identify and refine the training dataset by removing non-contributory samples, enhancing the efficiency and accuracy of the classification task.
-
–
Improved Performance and Reduced Computational Costs: By refining the dataset and focusing on significant training samples, we improved overall classification performance and reduced computational costs. This approach led to better utilization of resources and more accurate predictions.
-
–
Enhanced Explainability and Insights: Our method provides a detailed explainability analysis, offering insights into the characteristics of influential data elements. This not only helps in understanding the prediction model better but also aids in managing large-scale data by capturing critical patterns and relationships within the data.
2 Related works
2.1 Graph Neural Networks
GNNs are a type of neural network designed to work with graph-structured data, able to learn representations that capture complex relationships between nodes. They are mathematical models that can be easily adapted to different tasks in the domain of the graphs, like node or graph classification and regression, edge prediction, graph generation or node clustering. They found several applications in real scenarios, such as in community detection, social network analysis, molecular property prediction or knowledge graph generation; moreover, they have found a wide scope for application in the field of particle physics [37] and high-energy physics (HEP) (e.g., particle tracking and reconstruction [16, 24]). The physics tasks of the LHC present many potential applications where graph neural networks have been successfully applied [14]. [17] employed a GNN for the determination of charged particle trajectories in collisions. [31] tackled the pileup mitigation problem, the presence of parasitic low-transverse-momentum collisions, by employing a three-layer Gated Graph Neural Networks with residual connections. More recently, [9] proposed a rotation-equivariant, with respect to rotations around the jet axis, GNN to extract novel phenomena in the standard model effective field theory (SMEFT) context from LHC collision data
2.2 Data attribution
Training Data Attribution (TDA) methods aim to understand the influence or importance of individual training data points on the predictions made by a machine learning model, identifying data points’ influence on the model’s output. Influence estimation approaches can be divided into two main classes: retraining-based and gradient-based [22]. Retraining-based methods assess the influence of training data by repeatedly retraining the model using different subsets of the training set, while gradient-based influence estimators determine influence by analyzing the alignment of training and test instance gradients, either throughout the training process or at its conclusion. Retraining-based methods comprehend the simplest and more computationally expensive leave-one-out (LOO) [42] or downsampling [18]. More interestingly are Gradient-based methods: they typically provide closed-form TDA scores by employing gradients in an efficient and scalable way. [28] was one of the first works in this field, by approximating the real influence effect of a training point by employing the gradients of the loss functions. TracIn [34] traces loss changes on test points during the training process, while TRAK [33] uses the neural tangent kernel with random projection to assess influence. These gradient-based methods have significantly reduced computational costs compared to retraining-based methods. However, they typically rely on the assumption of a first-order approximation of the loss, which can lead to performance degradation on neural networks [5, 4] and be more sensitive to randomness associated with model weight initialization and training mechanisms [25]. The latest approach in the TDA scenario demonstrated the effectiveness of ensembling in improving TDA scores with gradient-based methods to solve these typical issues [13, 12]. Ensembling usually involves applying the TDA method to many independently trained models (e.g., averaging the final TDA scores or aggregating some intermediate terms for score calculation). Despite their effectiveness, these ensembling methods require a substantial number of ensembles to perform well, a constraint that requires an important computational cost.
2.3 Data distillation
Data distillation refers to the process of carefully choosing which data points to include in the training set for a deep learning model, as the quality and distribution of the training data can significantly impact the model’s performance or computational resources needed [35]. It involves summarizing or compressing a large dataset into a smaller, more manageable subset while retaining the most essential information needed for training models. This process aims to maintain the performance of models trained on the distilled data, ensuring that they perform similarly to models trained on the full dataset. Data distillation methods can be categorized into four main types. Meta-model matching [41, 30] optimizes the transferability of models trained on distilled data to the original dataset. Gradient matching [47, 49] aligns the gradients of training and distilled datasets to ensure similar model performance. Trajectory matching [11, 8] aims to match the training trajectories of models on distilled and full datasets. Distribution matching [48, 39] directly aligns the statistical distributions of the distilled and original datasets. These methods create high-fidelity, compressed datasets that retain essential information for effective machine learning model training and inference. Influence functions have not yet been used for direct dataset distillation, but they have been employed together in some similar tasks. [45] has used a distilled dataset with a reverse gradient matching technique to approximate the computation of influence values of a smaller dataset achieving promising results. [34] shows the effectiveness of TracIn methods by identifying mislabeled data and filtering them out of the dataset.
Particle | F1 | F2 | F3 | F4 | F5 | F6 |
---|---|---|---|---|---|---|
jet1 | - | - | ||||
jet2 | - | - | ||||
jet3 | - | - | ||||
b1 | - | |||||
b2 | - | |||||
lepton | - | - | - | |||
energy | - | - | - | - |
3 Methodology
Our method proposes to integrate TracIn, an important data attribution method, with GNNs to enhance ATLAS analyses event classification tasks, improving performance and interpretability. We developed a three-step method: initially, we train a GNN model to classify collision event types, then we use TracIn to identify influence scores in training samples; finally, we re-train the model on a selected subset. Training elements that don’t positively contribute to the classification task are then removed, improving classification metrics and reducing computational costs.
3.1 Problem formulation
The ‘SUSY dataset’ [2] contains Monte Carlo simulated collision events recorded with the ATLAS experiment, representing signals over a large background with observable kinematic features. Two types of events are considered:
-
–
: SuSy Dark Matter Monte Carlo candidate events
-
–
: SM backgrounds form single top and top-antitop processes.
An example of signal event is presented in Fig. 1. The main task involves recognizing rare signals over large backgrounds from the Standard Model processes. To recognize them, we have kinematic features that offer discriminating power in solving the task. The particle collision event can be represented as a graph : the GNN takes as input and outputs its probabilities over classes 0, background, or 1, signal. The collision events can be represented as fully connected graphs with 6/7 nodes, , and a maximum of 6 features, i.e., . The particle features [2] introduced are: the transverse momentum , the angular variables and , the missing transverse momentum , the mass , and the jet flavor probability . By defining the graph as fully connected, we can define its adjacency matrix . Then, can be alternatively expressed as . A graph representation and a table representing the features employed for each node is presented in Fig. 4 and Tab. 4.
3.2 Framework’s workflow
Preliminary training.
The GNN baseline for our experiments consists of a sequence of 2-layer Graph Convolutional operators () [27], a global mean pooling operator, and a final linear layer ; we used ReLU as non-linear activation. Formally, the model can be expressed as:
(1) |
Once defined the model, the first step of our workflow consists of training it on the original full training set or a randomly selected subset of it: we call these approaches GNN-FT and GNN-RST respectively. This step is essential since it allows us to collect training checkpoints that are later used by the TracIn method to generate influence scores. Moreover, the experimental results obtained from both GNN-FT and GNN-RST will serve as metrics of comparison with our method.
Influence-based training.
We employ the TracIn [34] method as a baseline for estimating training data influence scores. It assigns an influence score to each training sample to determine its impact on the dataset. It generates influence scores via a scalable and efficient implementation: a first-order gradient approximation is performed to the exact computation of the influence values to reduce the computational cost, it utilizes checkpoints to more efficiently reproduce the training process, and finally, we choose the final layer for computing the loss gradients, i.e., the last linear layer. All these characteristics make TracIn an optimal candidate for data-intensive scenarios. For each training sample, we compute the influence score of it for itself: these values take the name of Self-influence (SI). By representing the loss function , having checkpoints available, learning rate , trainable weights of the -th layer and training sample , the Self-influence score can be computed as follow:
(2) |
It traces how a training point influences its own prediction: high values of self-influence scores correspond to the most diverging samples, potential outliers, mislabeled data, or more general samples with contrasting behavior. Self-influence scores have been used previously for finding mislabeled or confounding images and unsupervised anomaly detection tasks [34, 38]. The main idea for our method is that by removing harmful, superfluous, or counterproductive samples, from the model and the task point of view, we can both increase model accuracy and computational efficiency. Once we have a self-influence score for each training sample, we filter out the one with the highest values and the remaining will constitute the final training set. In this way, training elements that do not positively contribute to the classification task are removed and the final dataset will contribute to increase classification metrics and reducing computational costs. In the final step, we train the GNN baseline on the influence-based reduced training set: we named this approach GNN-IRST.
Method | %Train Samples | % Influence Samples | % Total Train Samples | Accuracy | F1-score | Precision | Recall | AUROC |
---|---|---|---|---|---|---|---|---|
GNN-FT | 100 | – | 100 | 74.32 ± 0.7 | 72.96 ± 1.6 | 77.27 ± 3.4 | 69.65 ± 5.8 | 82.92 ± 0.7 |
GNN-RST | 80 | – | 80 | 73.57 ± 1.2 | 72.58 ± 2.2 | 75.96 ± 4.8 | 70.47 ± 8.4 | 82.67 ± 0.9 |
GNN-IRST (Our) | 80 | 80 | 64 | 74.67 ± 1.0 | 74.18 ± 1.7 | 74.99 ± 3.7 | 74.20 ± 7.8 | 82.33 ± 0.9 |
GNN-IRST (Our) | 90 | 80 | 72 | 74.76 ± 0.7 | 73.59 ± 2.0 | 77.43 ± 3.5 | 70.72 ± 6.4 | 82.95 ± 0.8 |
4 Experimental setup
We performed several experiments to validate the efficiency of our proposed methodology. In the following we will show the experimental setup, the numerical and visual results.
4.1 Experiments’ Parameters
Our experiments included the exploration of various parameters, each contributing uniquely to the refinement of our methodology. The metrics used for the method’s evaluation were Precision, Recall, F1-score, Accuracy, and AUROC. We utilized the Adam [26] optimizer with a learning rate of and a weight decay of . Numerical results are presented as the mean and standard deviation from 4 executions with different random seeds. The experiments ran for a maximum of 300 epochs, with 16,000 samples.
We evaluated the performance using different subset percentages of the training set: 95%, 90%, 80%, 50%, 20%, 10%, and 5%. Additionally, we examined influence-based subsets with percentages of 100%, 80%, 50%, and 20%. This comprehensive analysis allowed us to assess how training set composition and influence-based selection impact the GNN model’s performance, providing insights into the efficiency and effectiveness of our method. We tested three different setups:
-
–
GNN-FT: GNN model trained on the full train set.
-
–
GNN-RST: GNN model trained on a random-selected subset of the original train set.
-
–
GNN-IRST (Our): GNN model trained on the influence-based random subset of the train set.
Testing and metrics comparisons are conducted relative to the original test set size, serving as the benchmark for evaluating performance and assessing the effectiveness of various methods.
Method | % Train sample | % Influence Samples | % Original trainset | Accuracy | F1-score | Precision | Recall | AUROC |
---|---|---|---|---|---|---|---|---|
GNN-FT | 100 | – | 100 | 74.32 ± 0.7 | 72.96 ± 1.6 | 77.27 ± 3.4 | 69.65 ± 5.8 | 82.92 ± 0.7 |
GNN-RST | 95 | – | 95 | 73.40 ± 1.5 | 71.75 ± 1.7 | 76.93 ± 3.2 | 67.40 ± 3.9 | 81.98 ± 1.0 |
GNN-IRST | 80 | 76 | 73.40 ± 1.1 | 73.40 ± 2.2 | 72.27 ± 2.8 | 76.72 ± 6.0 | 81.55 ± 1.6 | |
GNN-RST | 90 | – | 90 | 73.60 ± 0.9 | 72.50 ± 2.1 | 76.14 ± 3.7 | 69.89 ± 6.5 | 82.78 ± 1.0 |
GNN-IRST | 80 | 72 | 74.76 ± 0.7 | 73.59 ± 2.0 | 77.43 ± 3.5 | 70.72 ± 6.4 | 82.95 ± 0.8 | |
GNN-RST | 80 | – | 80 | 73.57 ± 1.2 | 72.58 ± 2.2 | 76.00 ± 4.8 | 70.47 ± 8.4 | 82.67 ± 0.9 |
GNN-IRST | 80 | 64 | 74.67 ± 1.0 | 74.18 ± 1.7 | 74.99 ± 3.7 | 74.20 ± 7.8 | 82.33 ± 0.9 | |
GNN-RST | 50 | – | 50 | 72.92 ± 1.1 | 71.25 ± 2.3 | 76.00 ± 2.7 | 67.37 ± 5.5 | 81.07 ± 1.6 |
GNN-IRST | 80 | 40 | 73.06 ± 1.0 | 74.18 ± 1.6 | 71.28 ± 0.1 | 77.57 ± 4.0 | 81.25 ± 1.7 |
4.2 Experimental results
Primary numerical results are presented in Tab. 1: we compare GNN-FT and GNN-RST results with our two best-performing models. In terms of overall accuracy on the selected metrics, our method consistently achieves good results, showcasing its robustness and reliability. By employing only 64% of the original training set we’re able to achieve a higher F1-score and Recall score with respect to GNN-FT, while with a percentage of 72% we achieve also better accuracy, Precision and AUROC score. Moreover, this last setup is able to outperform for all the metrics the GNN-RST methodology, using also 8% less training data. In Tab. 2 we perform a deep comparison of GNN-IRST performance with respect to GNN-FT and GNN-RST with different training subset percentages. For almost all the metrics and setups, an influence-based selection mechanism achieves better results with respect to the random one: this shows how a careful choice of training data can increase the overall performance of this method. We show some plots representing the evolution of AUROC and F1-score metric respectively at different data percentages in Fig. 5 and Fig. 6 respectively. In Fig. 5 we can observe a slight improvement of the AUROC metric by employing 20% less training sample with respect to the random-based subset selection. In Fig. 6a and Fig. 6b we can observe instead better performances for almost all subset percentage values; moreover, here we achieve better F1-score results with respect to the full-size training set setup at many different subset percentage values.
5 Conclusions
Utilizing data-attribution methods for data selection tasks proves to be an efficient strategy for enhancing the performance of GNN models in scenarios defined by a large amount of information. By precisely selecting a smaller subset of the original dataset based on self-influence values, our method can achieve competitive or superior results compared to approaches which use the full training set or a random reduced version of it. Carefully selecting training data can significantly improve accuracy as well as the computational and operational costs of data-intensive problems, as proven with LHC experiments. This precise selection also aids in the effective downstream processing of vast amounts of data. Moving forward, our efforts will focus on the development of a custom architecture able to manage and leverage the full potential of influence information or also on exploring and implementing various data influence methods: these techniques not only expand the traditional learning framework but also enhance the interpretability of the inner workings of these networks. This attribute is particularly demanded in intricate and critical scenarios like high-energy physics.
Acknowledgment
The work is partly funded by the European Union’s CHIST-ERA programme under grant agreement CHIST-ERA-19-XAI-009 (MUCCA). AV is supported in part by the Italian Ministry of University and Research (MUR), which funded his PhD as per the Ministerial Decree no. 1061/2021. SG is supported by PNRR MUR project PE0000013-FAIR. SS is partly funded by Sapienza grants RM1221816BD028D6 (DESMOS) and RG123188B3EF6A80 (CENTS).
References
- [1] ATLAS Collaboration: The ATLAS Experiment at the CERN Large Hadron Collider. JINST 3, S08003 (2008)
- [2] ATLAS Collaboration: Search for direct production of electroweakinos in final states with one lepton, jets and missing transverse momentum in pp collisions at = 13 TeV with the ATLAS detector. JHEP 12, 167 (2023)
- [3] ATLAS Collaboration: Software and computing for Run 3 of the ATLAS experiment at the LHC (4 2024)
- [4] Bae, J., Ng, N., Lo, A., Ghassemi, M., Grosse, R.B.: If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems 35, 17953–17967 (2022)
- [5] Basu, S., Pope, P., Feizi, S.: Influence functions in deep learning are fragile. arXiv preprint arXiv:2006.14651 (2020)
- [6] Bird, I.: Computing for the Large Hadron Collider. Ann. Rev. Nucl. Part. Sci. 61, 99–118 (2011)
- [7] Buss, T., Dillon, B.M., Finke, T., Krämer, M., Morandini, A., Mück, A., Oleksiyuk, I., Plehn, T.: What’s anomalous in lhc jets? SciPost Physics 15(4), 168 (2023)
- [8] Cazenavette, G., Wang, T., Torralba, A., Efros, A.A., Zhu, J.Y.: Dataset distillation by matching training trajectories. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4750–4759 (2022)
- [9] Chatterjee, S., Sánchez Cruz, S., Schöfbeck, R., Schwarz, D.: Rotation-equivariant graph neural network for learning hadronic smeft effects. Physical Review D 109(7), 076012 (2024)
- [10] Crochet, P., Braun-Munzinger, P.: Investigation of background subtraction techniques for high mass dilepton physics. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 484(1-3), 564–572 (2002)
- [11] Cui, J., Wang, R., Si, S., Hsieh, C.J.: Scaling up dataset distillation to imagenet-1k with constant memory. In: International Conference on Machine Learning. pp. 6565–6590. PMLR (2023)
- [12] Dai, Z., Gifford, D.K.: Training data attribution for diffusion models (2023)
- [13] Deng, J., Li, T.W., Zhang, S., Ma, J.: Efficient ensembles improve training data attribution (2024)
- [14] DeZoort, G., Battaglia, P.W., Biscarat, C., Vlimant, J.R.: Graph neural networks at the large hadron collider. Nature Reviews Physics 5(5), 281–303 (2023)
- [15] Dillon, B.M., Favaro, L., Plehn, T., Sorrenson, P., Krämer, M.: A normalized autoencoder for lhc triggers. SciPost Physics Core 6(4), 074 (2023)
- [16] Duarte, J., Vlimant, J.R.: Graph Neural Networks for Particle Tracking and Reconstruction, chap. Chapter 12, pp. 387–436. https://www.worldscientific.com/doi/abs/10.1142/9789811234033_0012
- [17] Elabd, A., et. al.: Graph neural networks for charged particle tracking on fpgas. Frontiers in Big Data 5 (Mar 2022), http://dx.doi.org/10.3389/fdata.2022.828666
- [18] Feldman, V., Zhang, C.: What neural networks memorize and why: Discovering the long tail via influence estimation (2020)
- [19] Georgiev, K., Vendrow, J., Salman, H., Park, S.M., Madry, A.: The journey, not the destination: How data guides diffusion models. arXiv preprint arXiv:2312.06205 (2023)
- [20] Guest, D., Cranmer, K., Whiteson, D.: Deep learning and its application to lhc physics. Annual Review of Nuclear and Particle Science 68, 161–181 (2018)
- [21] Guu, K., Webson, A., Pavlick, E., Dixon, L., Tenney, I., Bolukbasi, T.: Simfluence: Modeling the influence of individual training examples by simulating training runs. arXiv preprint arXiv:2303.08114 (2023)
- [22] Hammoudeh, Z., Lowd, D.: Training data influence analysis and estimation: A survey. Machine Learning 113(5), 2351–2403 (2024)
- [23] Ilyas, A., Park, S.M., Engstrom, L., Leclerc, G., Madry, A.: Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622 (2022)
- [24] Ju, X., Farrell, S., Calafiura, P., Murnane, D., Prabhat, Gray, L., Klijnsma, T., Pedro, K., Cerati, G., Kowalkowski, J., Perdue, G., Spentzouris, P., Tran, N., Vlimant, J.R., Zlokapa, A., Pata, J., Spiropulu, M., An, S., Aurisano, A., Hewes, V., Tsaris, A., Terao, K., Usher, T.: Graph neural networks for particle reconstruction in high energy physics detectors (2020)
- [25] K, K., Søgaard, A.: Revisiting methods for finding influential examples (2021)
- [26] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- [27] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)
- [28] Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions (2020)
- [29] Kwon, Y., Wu, E., Wu, K., Zou, J.: Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. arXiv preprint arXiv:2310.00902 (2023)
- [30] Loo, N., Hasani, R., Amini, A., Rus, D.: Efficient dataset distillation using random feature approximation. Advances in Neural Information Processing Systems 35, 13877–13891 (2022)
- [31] Martínez, J.A., Cerri, O., Spiropulu, M., Vlimant, J., Pierini, M.: Pileup mitigation at the large hadron collider with graph neural networks. The European Physical Journal Plus 134(7), 333 (2019)
- [32] Nohyun, K., Choi, H., Chung, H.W.: Data valuation without training of a model. In: The Eleventh International Conference on Learning Representations (2022)
- [33] Park, S.M., Georgiev, K., Ilyas, A., Leclerc, G., Madry, A.: Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186 (2023)
- [34] Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems 33, 19920–19930 (2020)
- [35] Sachdeva, N., McAuley, J.: Data distillation: A survey (2023)
- [36] Staszewski, R., Chwastowski, J.: Transport simulation and diffractive event reconstruction at the lhc. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 609(2-3), 136–141 (2009)
- [37] Thais, S., Calafiura, P., Chachamis, G., DeZoort, G., Duarte, J., Ganguly, S., Kagan, M., Murnane, D., Neubauer, M.S., Terao, K.: Graph neural networks in particle physics: Implementations, innovations, and challenges (2022)
- [38] Thimonier, H., Popineau, F., Rimmel, A., Doan, B.L., Daniel, F.: Tracinad: Measuring influence for anomaly detection. In: 2022 International Joint Conference on Neural Networks (IJCNN). pp. 1–6. IEEE (2022)
- [39] Wang, K., Zhao, B., Peng, X., Zhu, Z., Yang, S., Wang, S., Huang, G., Bilen, H., Wang, X., You, Y.: Cafe: Learning to condense dataset by aligning features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12196–12205 (2022)
- [40] Wang, S.Y., Efros, A.A., Zhu, J.Y., Zhang, R.: Evaluating data attribution for text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7192–7203 (2023)
- [41] Wang, T., Zhu, J.Y., Torralba, A., Efros, A.A.: Dataset distillation. arXiv preprint arXiv:1811.10959 (2018)
- [42] Weisberg, S., Cook, R.D.: Residuals and influence in regression (1982)
- [43] Woźniak, K.A., Belis, V., Puljak, E., Barkoutsos, P., Dissertori, G., Grossi, M., Pierini, M., Reiter, F., Tavernelli, I., Vallecorsa, S.: Quantum anomaly detection in the latent space of proton collision events at the lhc. arXiv preprint arXiv:2301.10780 (2023)
- [44] Xie, T., Li, H., Bai, A., Hsieh, C.J.: Data attribution for diffusion models: Timestep-induced bias in influence estimation. arXiv preprint arXiv:2401.09031 (2024)
- [45] Ye, J., Yu, R., Liu, S., Wang, X.: Distilled datamodel with reverse gradient matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11954–11963 (2024)
- [46] Ypsilantis, T., Séguinot, J.: Particle identification for lhc-b: a dedicated collider b experiment at the lhc. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 368(1), 229–233 (1995)
- [47] Zhao, B., Bilen, H.: Dataset condensation with differentiable siamese augmentation. In: International Conference on Machine Learning. pp. 12674–12685. PMLR (2021)
- [48] Zhao, B., Bilen, H.: Dataset condensation with distribution matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6514–6523 (2023)
- [49] Zhao, B., Mopuri, K.R., Bilen, H.: Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929 (2020)
- [50] Zheng, X., Pang, T., Du, C., Jiang, J., Lin, M.: Intriguing properties of data attribution on diffusion models. arXiv preprint arXiv:2311.00500 (2023)