Strategies For Machine Learning Applied To Noisy Hep Datasets: Modular Solid State Detectors From Supercdms
Strategies For Machine Learning Applied To Noisy Hep Datasets: Modular Solid State Detectors From Supercdms
Strategies For Machine Learning Applied To Noisy Hep Datasets: Modular Solid State Detectors From Supercdms
Abstract
Background reduction in the SuperCDMS dark matter experiment depends on re-
moving surface events within individual detectors by identifying the location of each
incident particle interaction. Position reconstruction is achieved by combining pulse
shape information over multiple phonon channels, a task well-suited to machine learn-
ing techniques. Data from an Am-241 scan of a SuperCDMS SNOLAB detector was
used to study a selection of statistical approaches, including linear regression, artificial
neural networks, and symbolic regression. Our results showed that simpler linear re-
gression models were better able than artificial neural networks to generalize on such
a noisy and minimal data set, but there are indications that certain architectures and
training configurations can counter overfitting tendencies. This study will be repeated
on a more complete SuperCDMS data set (in progress) to explore the interplay between
data quality and the application of neural networks.
1 Introduction
The SuperCDMS experiment [1] is a direct dark matter search performed with modular
cryogenic solid-state detectors. Each detector is an ultrapure disk of germanium or sili-
con, roughly the size of a hockey puck. The detectors are stacked into towers of 6 each
and the towers are are operated at cryogenic temperatures within a shielded cryostat in
an underground lab. Interactions with incident particles releases ionization charges and
athermal phonons, which can be collected on the top and bottom faces of each detector by
thousands of sensors organized into multiple channels. Signal partition among the chan-
nels and the individual pulse shapes themselves provide information on the location of the
interaction.
The internal physics of the phonon and charge transport within the crystal is poorly
understood, making it difficult to model the resulting pulse shapes and shared channel
1
behavior from first principles. Therefore, it is useful to explore machine learning (ML)
techniques, which can search broadly over a complex parameter space to reveal correla-
tions and identify the most salient features. We report on the first study using ML to create
position reconstruction algorithms in a prototype SuperCDMS detector illuminated by a
in-situ movable radioactive source.
This project is part of the FAIR4HEP initiative1 which uses high-energy physics as a
science driver for the development of community-wide FAIR (Findable, Accessible, Intero-
perable, and Reusable) frameworks [2] to advance our understanding of AI and provide
new insights to apply AI techniques. Thus, while this paper presents our own exploration
of the efficacy of AI as applied to a small and inherently noisy data set, the data is accessi-
ble [3], the results are preserved in a reproducible fashion in line with recent adaptation of
FAIR principles for AI applications [4, 5] and researchers are encouraged to expand on our
study2 . A second data set with improved area coverage and higher statistics is in prepara-
tion and, in comparison with the current data set, will provide important insights into the
limitations and strategies required when ML techniques are applied to increasingly more
complex data sets.
2
Figure 1: The SuperCDMS detectors come in two sensor arrangements: Interleaved Z-
sensitive Ionization & Phonon (iZIP) and High Voltage (HV). The arrangement of the 12
channels formed by grouping together individual transition edge sensors (TES) are shown
above photographs of the detectors.
3
drives the channel design, with two outer rings for improved fiducialization. However, a
simple veto is not possible since, as explained in the following section, the formation and
readout of phonon pulses takes hundreds of microseconds and intersects all the channels.
4
Figure 2: The aluminum collection fins which cover the two faces of the SuperCDMS HV
detector can be seen on the left as bright ovals. On the right panel is a schematic of the
quasiparticle trapping process. The quasiparticles are orange dots and the Cooper pairs
are shown as double dots. Each oval set of collection fins surrounds a tungsten TES (in
red) and its AL/W overlap regions (in purple).
where C, D, E respectively represent the yellow, red, and orange channels in the HV de-
tector module in Figure 1, X and Y are the projective coordinate mappings based on pulse
amplitudes or start-times.
5
103
-0.15 101
-15 101
-0.30
-30
100 -30 -15 0 15 30 100
-0.30 -0.15 0.00 0.15 0.30
Center Ring X Partition Center Ring X Delay [ s]
Figure 3: Position mapping using the three inner ring channels on one side of an HV
detector.
Figure 3 plots the resulting positions for a dataset of around 30,000 interactions with
deposited energies between 13 and 19 keV. About half of the interactions were produced
by a collimated gamma radiation source aimed at the center of the detector. The spot at
the center of both plots is the reconstructed position of these events. Background events
from environmental gammas and betas uniformly populate the detector surface.
If these mappings were representative of true position reconstructions, the figures
would show a uniform density. Instead, we see a pattern of non-uniformity in these dis-
tributions. The non-uniformities are different in the two plots, suggesting that partition
and delay (amplitude and timing) might provide complementary information in a true po-
sition reconstruction problem. These mappings can give some idea of (x, y) positions, but
the relationship to true position is difficult to determine. While both figures show distinc-
tive symmetrical properties emerging from the detector geometry, these relationships are
also not single-valued, leading to a pattern of degeneracies. A more sophisticated position
reconstruction is desirable - one that potentially combines amplitude and detailed timing
information from all channels.
6
800
Arbitrary Units
Pulse A
Pulse B
600 Pulse C
Pulse D
400 Pulse F
200
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Time [ms]
Figure 4: Left: The locations of the Am-241 source during data collection are shown
as blue dots superposed on the SuperCDMS HV detector channels (dimensions shown in
mm). Right: phonon pulse shapes recorded by individual channels on the top surface of a
SuperCDMS SNOLAB HV prototype at 300 V bias, when the Am-241 source is above the
circled location. Channel E was not operational for this test.
diameter hole 6 mm above the detector surface. As the source was scanned across the
detector, the resulting waveforms were recorded with a 1.25 MHz digitizer, 4096 samples
per waveform.
The movable radioactive source was used to produce interactions at thirteen different
locations on the detector along a radial path from the central axis to close to the detector’s
outer edge, as shown in Figure 4. Waveforms from a total of 7151 particle interactions
were recorded over these 13 impact locations. The exact locations of the source and the
number of interactions are given in Table 1.
Figure 4 shows the waveforms (labeled by their channel) resulting from an interaction
with the source at the circled location in the figure. As the phonons move outward from the
interaction point, the closest channels (in this case F and D) are intersected first, forming
the sharp risetime and high amplitudes seen in waveforms F and D. All channels collect
phonons as the vibration spreads outward, reflecting multiple times on all surfaces, thus
forming the long tails.
7
Table 1: The different impact locations and the corresponding number of experiments
performed at that location. The entries marked as bold are used as the held-out set, as
explained in Section 3.3, to test the ability of the ML models to generalize for previously
unseen data.
Further cleaning of the data was performed in order to remove background, isolate the
source information and reject bad pulse shapes, as follows:
1. Mismatch between algorithms is used to quantify pulse amplitude. If the integral am-
plitude fit does not agree with the pulse shape template, the interaction is rejected.
This effectively removes pile-up when two interactions occur so close in time that
multiple pulses appear in the same waveform.
2. Timing outliers due to failure of pulse timing algorithms for distortions or noise in
the waveforms are rejected.
3. If the fraction of energy in the channel nearest to the radioactive source location is
low, it is likely caused by a background interaction in a different channel. These
events are rejected.
4. The Am-241 source produces characteristic x-ray energies which appear as peaks in
plots of distributions of the total energy. We select only interactions in a region of
interest around the 14.0 and 17.7 keV peaks.
• start: The time at which the pulse rises to 20% of its peak with respect to Channel A
8
• rise: The time it takes for a pulse to rise from 20% to 50% of its peak
• width: The width (in seconds) of the pulse at 80% of the pulse height
• fall: The time it takes for a pulse to fall from 40% to 20% of its peak
The full dataset provides 85 input features for each interaction. These include one ampli-
tude parameter and sixteen timing/shape parameters for the waveforms for each of the 5
channels (A, B, C, D, F, since E was inoperable during the run). The amplitude parameter
is a measure of the size of each waveform based on comparison to a normalized waveform
template and given in arbitrary units. It is labeled by channel, e.g. Bamp. The timing
parameters represent the time during the rise and fall of the waveform at which the pulse
reaches a given percentage of its maximum height. The parameters are given names such
as Cr40 (the 40%-point for channel C as the waveform is rising) and Ff80 (the 80%-point
for channel F as the waveform is falling). These times are measured with respect to Ar20,
the timestamp of the pulse in channel A reaching 20% of its maximum height. Thus, Ar20
is always zero, reducing the number of non-trivial features to 84. In both datasets, the
timing information is given in units of micro-seconds (µs) and the impact location is given
in units of millimeters (mm).
To train the ML models and also test its generalization to new samples with never-seen
positions during the process of model learning, we first divide the dataset into two separate
subsets:
• Model-learning subset MLS contains 5636 data samples whose positions are in the set
{0.0, -3.969, -9.988, -17.992, -19.700, -21.034, -24.077, -36.116, -39.400, -41.010}.
The subset is further divided into training and validation sets in a 4:1 ratio. The
splitting is done randomly during each instance of model training. Hence, each
model is trained multiple times to test its sensitivity to this random split. The training
set is used to train the ML models and the validation set is used to validate the
performance of the learned model.
• Held-out subset HOS contains 1515 data samples whose positions are in the set {-
12.502, -29.500, -41.900}. This is treated as our test set in which we evaluate the
performance of the learned ML model in generalizing to previously unobserved loca-
tions on this subset.
To facilitate the learning process, it is essential to standardize the features. In this work,
a standard scaling method is used, where each feature is scaled to have a mean of zero
and a standard deviation of one. Fig. 5 showcases the feature distributions for the simple
dataset before (Fig. 5(a)) and after (Fig. 5(b)) standardization. Note that the numerical
range widely varies across features, indicating that it is crucial to standardize the data in
order to mitigate the domination of learning by a small subset of features.
9
1e-4 Unnormalized Features Normalized Features - Standard Scaler
3
[s]
4
2
3
1
2
0
1 -1
-2
0
-3
Cs rt
Ds t
Fs rt
Ar t
ise
ise
ise
ise
se
ll
ll
all
all
Aw ll
Bw th
Cw th
Dw h
Fw th
h
Cs t
Ds t
Fs t
Ar t
ise
ise
ise
ise
se
ll
ll
all
all
Aw l
Bw th
Cw th
Dw h
Fw th
h
l
tar
tar
tar
tar
tar
tar
Afa
Bfa
Ffa
Afa
Bfa
Ffa
idt
idt
idt
idt
ta
ta
id
id
id
id
id
id
Cf
Df
Cf
Df
Fri
Fri
Br
Cr
Dr
Br
Cr
Dr
Bs
Bs
(a) (b)
Figure 5: The numerical range of each feature before normalization (left) and after nor-
malization (right).
source position with the parameter and fitting to a function. Such an approach can be
applied to this dataset as well, which can provide an analytical benchmark with which to
compare ML results. An example of such a timing parameter is the time when the pulse
amplitude in channel C reaches 30% of its peak amplitude compared to the same point in
relative amplitude of the channel A pulse. The data can be fit to a hyperbolic sine function
in order to predict the location of an interaction at any unknown point along this line
(Figure 6).
Examining Figure 4 reveals why the start time of Pulse C is a good choice for a single
parameter to characterize position in this dataset: the source was moved along a path
taking it farther and farther from Channel C. After fitting using to the model-learning
subset (see Section 3.3) we obtain a root mean squared error (RMSE) of 2.125mm on
the held-out subset points. This example curve fit provides an analytical benchmark with
which to compare machine learning solutions.
While we see some success from fitting from a single variable in channel C, there is
clearly much more information in the waveforms of all the different channels. Finding the
optimal use of all the information is a task better suited to machine learning. The problem
can be formulated as a standard regression task where the relationship between the impact
location (y) and the collection of observed pulse information (x) can be expressed as
y = fζ (x; θ) (5)
10
Interaction Location y [mm]
0 Data
1 sinh ¡x a ¢ + d
c b
-10 a=1.6199 c=-0.0452
b=49.2475 d=-9.1611
-20
-30
-40
where k is the number of input features. The parameters of the model (the bias term
θ0 and the θi coefficients) are determined by minimizing the mean squared error (MSE),
often accompanied by a regularizing term
N
1 X
ℓ= (yj − yˆj ) + αRp θ⃗
2
(7)
N j=1
11
Ridge and Lasso regularizations set p = 2 and p = 1 respectively with α > 0. RMSE =
q
1
PN 2
N j=1 (yj − yˆj ) is considered as the performance metric.
10
Predicted Value [mm]
RMSE [mm]
Train Train (Ridge)
Validation 3.5 Validation (Ridge)
Test (Ridge)
0 Test Train (Lasso)
3.0 Validation (Lasso)
-10 Test (Lasso)
Train (OLS)
Validation (OLS)
-20 2.5 Test (OLS)
-30
2.0
-40
1.5
-50
-50 -40 -30 -20 -10 0 10 10 5 10 3 10 1 101 103 105
True Value [mm]
(a) (b)
Figure 7: Comparison of the predicted and true interaction positions from the OLS model
(left) and comparison of the MSE losses from the training, validation, and test data from
OLS regression with those from the Ridge and Lasso regression models for different values
of the regularization parameter α (right).
Figure 7(a) shows the comparison of true and predicted values of the interaction loca-
tion for the OLS regression on the reduced dataset. In figure 7(b), the RMSE values for the
training, validation, and test data are compared with those from Ridge and Lasso regres-
sion models for different choices of α for the same dataset. One immediate observation is
the model’s poor performance in generalizing for the test data: the RMSE from OLS for
the test data is much larger than those for the training and validation subsets. Also, the
values of α that have the least RMSE values for the test data in Ridge and Lasso regressions
don’t have the smallest RMSE values for training and validation data. This suggests that
the model does not generalize well for unobserved impact locations. The model’s inability
to generalize well can be attributed to the differences in the feature correlations between
the training and test data. As shown in Fig. 8, the correlation structure is notably different
between training and test data, hinting at the possibility that feature correlation itself may
be a function of the impact location.
Since the simple dataset shows a visibly distinct correlation structure among its fea-
tures, thus prohibiting the model to generalize, it is instructive to investigate if this prob-
lem can be overcome using the extended dataset. We found that the same training and
test subsets showed a more consistent correlation structure for the timing information in
the extended dataset. Therefore, the performance gap between the training and test set
might become smaller by training a regression model on the extended dataset. In order to
12
1.00 1.00
Bstart
Cstart
Dstart
Fstart
Arise
Brise
Crise
Drise
Frise
Afall
Bfall
Cfall
Dfall
Ffall
Awidth
Bwidth
Cwidth
Dwidth
Fwidth
-1.00 -1.00
(a) (b)
Figure 8: Correlation among input features of the simple dataset for (a) the training
dataset and (a) the test dataset.
identify the combination of dataset and model that gives the best performance and gener-
alization, the behavior of the OLS, Ridge, and Lasso regression models was examined for
(a) the reduced dataset, (b) the extended dataset without the amplitude information, and
(c) the extended dataset with the amplitude information. Additionally, owing to the large
correlations among different feature pairs in these datasets, principal component analysis
(PCA) was applied to reduce the number of features and remove multi-colinearity from
the input. A subset of the principal components was chosen, which accounted for 99.9%
of the observed variance in the training data. These reduced features were used in place
of the original features, x, in the linear regression model in Eqn. 6.
Figure 9 shows the distribution of the RMSE values for different choices of dataset and
associated models. Besides providing a comprehensive overview of model performance
for different datasets and regression models, it also allows us to make some important
observations. First, including the amplitude information significantly impairs the model’s
ability to generalize to the test data. The known issues with calibration and amplitude
measurement as described in the previous section may be responsible for this behavior.
Second, the best performance is obtained from the extended model without the amplitude
information. Not only the validation and the test RMSE are much lower than the ones
obtained from the reduced datasets, the test-data RMSE values are comparable with the
validation-data RMSE which indicates the model’s ability to generalize well. Third, the
spread of RMSE values for this dataset is also relatively small, indicating its insensitivity
to the actual training-validation split. And finally, the performance of the OLS, Ridge,
and Lasso models are comparable for both full and PCA-transformed feature sets, but the
PCA-transformed features always show worse performance than the full dataset.
13
3.00
RMSE [mm]
Validation
2.75 Test
2.50
2.25
2.00
1.75
1.50
1.25
1.00
LR (S-Full)
Ridge (S-Full)
Lasso (S-Full)
LR (S-PCA)
Ridge (S-PCA)
Lasso (S-PCA)
LR (X-nA-Full)
Ridge (X-nA-Full)
Lasso (X-nA-Full)
LR (X-nA-PCA)
Ridge (X-nA-PCA)
LR (X-A-Full)
Lasso (X-nA-PCA)
Ridge (X-A-Full)
Lasso (X-A-Full)
LR (X-A-PCA)
Ridge (X-A-PCA)
Lasso (X-A-PCA)
Figure 9: The distribution of RMSE values for validation and test data for different choices
of dataset and fit models. The different labels along the X axis refer to the different data-
model combinations. S, X-nA, and X-A refer to the reduced, extended without amplitude,
and extended with amplitude datasets respectively. Full and PCA refer to the models where
the full feature-set and the PCA-reduced features are used respectively. The distributions
of the RMSE losses were obtained for 50 different choices of random splitting for training
and validation dataset.
14
LR
0.4
Lasso
PCA
0.2
0.0
-0.2
Bias
Ar10
Ar30
Ar40
Ar50
Ar60
Ar70
Ar80
Ar90
Ar95
Ar100
Af95
Af90
Af80
Af40
Af20
Br10
Br20
Br30
Br40
Br50
Br60
Br70
Br80
Br90
Br95
Br100
Bf95
Bf90
Bf80
Bf40
Bf20
Cr10
Cr20
Cr30
Cr40
Cr50
Cr60
Cr70
Cr80
Cr90
Cr95
Cr100
Cf95
Cf90
Cf80
Cf40
Cf20
Dr10
Dr20
Dr30
Dr40
Dr50
Dr60
Dr70
Dr80
Dr90
Dr95
Dr100
Df95
Df90
Df80
Df40
Df20
Fr10
Fr20
Fr30
Fr40
Fr50
Fr60
Fr70
Fr80
Fr90
Fr95
Fr100
Ff95
Ff90
Ff80
Ff40
Ff20
Figure 10: The coefficients associated with different features of the extended dataset with-
out pulse amplitude from OLS, Lasso, and PCA-transformed regression models. Features
enclosed within two successive vertical red lines relate to the same detector channel. The
Lasso regression was performed with α = 10−4 .
While the performance metrics suggest that the OLS regression on the extended dataset
without amplitude information performs as well as the regularized variants, the actual es-
timate of coefficients can be unstable since there are a large number of highly correlated
feature-pairs in the dataset (Figure 10). The Lasso regularization helps reduce the number
of effective features in the model since it allows setting the coefficients of unimportant
features (e.g. features that don’t offer independent information) to zero. Examining the
coefficients of the Lasso regression with α = 10−4 , we found that between 33–37 of the
79 coefficients were set to zero for different training/validation splits. The number and
the set of coefficients actually set to zero varied between iterations. This is understand-
able since the model can obtain the same information from different linear combinations
of highly-correlated features. Some of the features with coefficients of larger magnitude
from the Lasso regression received weaker coefficient estimates from the PCA-transformed
regression, which accounts for the degradation of performance in PCA-transformed mod-
els.
To identify a subset of important features in a robust fashion, we applied a second
method for feature selection using a variance inflation factor (VIF). VIF is a feature-wise
metric that determines how well as feature can be expressed as a linear combination of
other features, defined as
1
VIFj = (8)
1 − Rj2
where Rj2 is the goodness-of-fit measure (i.e. the R2 value) for the j-th feature. Large
values of VIF indicate that a feature is highly correlated with other features in the dataset.
To obtain a reduced set of features for the extended dataset without amplitude, we se-
quentially removed features from the dataset which showed VIF > 1000. This sequential
feature pruning allowed simplification of the dataset while keeping a subset of relatively
independent features. The OLS model was then fit using the reduced feature-set. A t-test
was performed to determine if the coefficient estimates were significant at 95% confidence
level. To account for the variability due to training/validation split, this procedure was
15
performed NE = 50 times with different random splittings of the training/validation data,
thus providing an estimate of the importance of the features using the following relation:
NE
1 X
Importance[j] = |θj |i · [VIFj < 1000]i · [pj < 0.05]i (9)
NE i=1
where the index i represents the different iterations of the experiment and pj is the p-
value associated with the t-test on the j-th feature. It should be noted that the importance
assigned to a feature using Eqn. 9 depends the actual ordering of the features on the
dataset. While the process determines one subset of features that can describe the data
well, a different subset can be chosen based on a different ordering of the input features.
Ff40
Cr30
Cr20
Cr40
Af40
Cr10
Ff20
Af80
Cr80
Df40
Cr70
Bf80
Cf40
Af90
Ar50
Br60
Af95
Cf20
Br10
Cr90
0.00 0.02 0.04 0.06 0.08 0.10
Importance
Figure 11: The top 20 features and the corresponding importance values ( Eq. (9)) for the
extended dataset without amplitude information.
Figure 11 shows the top 20 features and the importance values assigned to them. Note
that the reduced model’s performance gets worse for VIF thresholds less than 1000. Also,
the number of features that were assigned zero importance by Eqn. 9 was 35, similar to
the number of zero coefficients found from the Lasso regression model. Comparing Fig. 10
with Fig. 11 reveals that the most important features following Eq. (9) also have relatively
larger coefficients from the Lasso regression. The median validation and test RMSE for
this reduced model were 1.24 and 1.32 respectively, close to the numbers obtained from
the OLS and Lasso regression models.
16
enable neural networks to identify more complex relationships between input and out-
put data that may be inaccessible through linear regression. Using neural networks, the
process of learning the mapping function f from Eq. (5) is transformed into solving the
optimization problem
min ℓ(y, f (x; θ)), (10)
θ
where x is the extracted features from observed signals, y is the true position/location
(ground truth), ℓ is the loss function, and θ is the network weights. For our implementa-
tions, we used root mean squared error (RMSE) as our loss function, defined as the square
root of the MSE from Eq. (7).
5 0 6 (
5 0 6 (
Figure 12: The performance (RMSE) curves on training and validation set.
The performance of our neural network models on the three datasets: (1) simplified,
(2) extended without amplitude, and (3) extended with amplitude is summarized in Fig. 13.
Similar to what was observed in case of the linear regression models, DNN models trained
17
on the extended dataset without amplitude information give us the best performance in
terms of its ability to close the performance gap between the training and test sets. Because
of the calibration issues with the recorded pulse amplitude information in the full data set
as noted in Section 3.1, inclusion of this information significantly worsens the models’
generalizability.
Contrary to expectations, naively increasing the model complexity does not improve
the performance. Instead, not only do we see an increase in the RMSE values for larger
network architectures, their ability to generalize for the Held-out test set (HOS) worsens
increasingly with the number of hidden layers. This is also verified in the distributions of
prediction from DNN-2, DNN-5, and DNN-10 models trained on the simple dataset for the
three impact locations in the HOS, as shown in Fig. 14. Only the predictions made by the
iteration of each DNN which produced the smallest RMSE for the test data is shown. It can
be seen that DNN-2 yields the best predictions across all test positions, consistent with the
results in Fig. 13.
DNN-10 (X-A-DO)
DNN-5 (X-A-DO)
DNN-2 (X-A-DO)
DNN-10 (X-nA-DO)
DNN-5 (X-nA-DO)
DNN-2 (X-nA-DO)
DNN-10 (S-DO)
DNN-5 (S-DO) Validation
DNN-2 (S-DO) Test
1 2 3 4 5 6 7 8
RMSE [mm]
Figure 13: The distribution of loss values for validation and test data for different choices
of data set and DNN models using dropout regularization. The different labels along the X
axis refer to the different data-model combinations. S, X-nA, and X-A refer to the reduced,
extended without amplitude, and extended with amplitude data sets respectively. DNN-2,
DNN-5, and DNN-10 refer to DNN models with 2, 5, and 10 hidden layers respectively.
The distributions of the RMSE losses were obtained for 50 different choices of random
initialization of network weights.
18
1 R R I ( Q W U L H V
1 R R I ( Q W U L H V
* U R X Q G W U X W K * U R X Q G W U X W K
'