Abstract
Implicit neural networks and the related deep equilibrium models are investigated. To train these networks, the gradient of the corresponding loss function should be computed. Bypassing the implicit function theorem, we develop an explicit representation of this quantity, which leads to an easily accessible computational algorithm. The theoretical findings are also supported by numerical simulations.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The conventional neural networks have a feedforward structure: several layers are stacked after each other and their output can be computed explicitly. To generalize this structure, the so called implicit neural networks were introduced and analyzed in [1,2,3,4,5]. Also, a related approach called the deep equilibrium models was developed in the works [6,7,8]. Shortly, this model can be described as a feedforward deep neural network with identic layers. Practically, by increasing the number of layers, the existence and the computation of an equilibrium state is investigated. More precisely, Bai et al. [6] formulated an L-layer feedforward network as
, where \(z^{[k]}\) are the hidden states in the k-th layer and the input x was utilized in each transition. In case of certain stability conditions, the limit \(L\rightarrow \infty\) corresponds to a fixed-point iteration and hence leads to an equilibrium solution z of the equation
However, the main task is to minimize a given loss function \({\mathcal {E}}(z;t)\) among the solutions of Eq. (1) by optimizing the parameters \(\theta\) of the transformation \(f_\theta\). Here, for a gradient-based optimization technique, the gradient \(\frac{\partial {\mathcal {E}}}{\partial \theta }\) of the loss function has to be calculated. The corresponding theory is based on the implicit function theorem, see [6, 9], and [1]. An efficient numerical approximation of z in Eq. (1) is also rather complex, for further details see [7, 10, 11].
Our contributions in this paper, being theoretical and empirical, are as follows. We introduce implicit neural networks similarly to deep equilibrium models, and propose a novel theory bypassing the traditional reliance on the implicit function theorem. This advancement leads to an easily accessible algorithm for computing the gradient. Our theoretical results are also confirmed with numerical simulations providing empirical evidence for our theoretical contributions.
2 Preliminaries
2.1 Construction of the network
In general, a neural network is represented by a directed graph [12], the computational graph of the network. A network is called feedforward or acyclic if the corresponding graph is acyclic. Similarly, cyclic directed graphs correspond to the so-called implicit neural networks.
We use K for the total number of vertices, which are also called the neurons. Let \(a_j\) denote the activation value of the j-th neuron with \(f_j: {\mathbb {R}} \rightarrow {\mathbb {R}}\) being the corresponding activation function. Common examples are, e.g., \(f_{j}(z)=\tanh (z)\) or \(f_{j}(z)=\max \{0,z\}\). Let the lift operator \({\hat{\cdot }}: {\mathbb {R}}^L \rightarrow {\mathbb {R}}^K\) be defined by the formula \({\hat{x}} = \left( x_1, \dots , x_L,0\dots ,0 \right) ^T\). That is, we assume that the input data is copied to the first L neurons of the network for simplicity.
For an input vector \(x\in {\mathbb {R}}^L\), the network is evaluated as follows. If a neuron with index j receives a stimulus of magnitude \(a_l\) from its ancestor of index l along the edge with the weight \(w_{j,l}\) and a constant stimulus \(b_{j}\) (also called the bias) applied to it, then the cumulated input \(z_j\) of this neuron is
Accordingly, the activation value of the neuron with index j is
Assume for simplicity that the indexing of the neurons is such that the last N neurons are the output ones.
2.2 Problem statement
Indeed, the formulas in (2), (3) define the following system of K equations:
Here, summarized, \(z(x), a(x), b \in {\mathbb {R}}^K,\) \(W\in {\mathbb {R}}^{K \times K}\) and \(x \in {\mathbb {R}}^L\). Also, the functions \(f_j: {\mathbb {R}}\rightarrow {\mathbb {R}}\) are given for \(j = 1, \dots , K\), which define \(f(z) = \left( f_1 (z_1), \dots , f_K (z_K) \right) ^T\in {\mathbb {R}}^K\) with \(z = \left( z_1, \dots , z_K \right) ^T\). Sometimes, we simplify the notation and omit the x-dependence of the terms in (4).
We also have M pairs of training samples (x, y) with \(x\in {\mathbb {R}}^L\) the input and \(y\in {\mathbb {R}}^N\) the target vectors assuming that \(1 \le L,N \le K\). To compute with vectors of size K, we introduce the operator \({\tilde{\cdot }}: {\mathbb {R}}^N \rightarrow {\mathbb {R}}^K\) defined by the formula \({\tilde{y}} = \left( 0,\dots ,0,y_1,\dots ,y_N \right) ^T\).
At the m-th pair of samples, i.e., at the input \(x^{(m)}\), the error is defined as
where \({\tilde{y}}_{j}^{(m)}\) denotes the corresponding component of the m-th training sample \(y^{(m)}=(y^{(m)}_1,\dots ,y^{(m)}_N) \in {\mathbb {R}}^N\) of the target vector to be compared with the value of \(a_j(x^{(m)})\).
The average error \({\mathcal {E}}\) over all pairs of training samples is given by
which will be minimized with respect W and b.
Solving Eq. (4) by a fixed-point iteration yields the vector \(z=\left( z_1,\dots ,z_K \right) ^T\) of neuron input values and the vector \(a=\left( a_1,\dots ,a_K \right) ^T\) of activation values. A single step of this has the form
An important observation is that the iteration in (7) delivers a feedforward neural network of infinite number of layers with K neurons in each layer, so we get a deep equilibrium model. In this framework, the fixed point iteration corresponds to the layer-wise computation with the original input. The weights of the edges passing between each two adjacent layers are given by the matrix \(W \in {\mathbb {R}}^{K\times K}\) and the bias vector \(b \in {\mathbb {R}}^K\). A small network is shown in Fig. 1 while Fig. 2 illustrates the unrolling of this network, focusing on the third neuron.
The pseudocode for computing the network is given in Algorithm 1.
2.3 Further notations
Summarized, we use the following notations in the infinitely deep network:
-
The initial vector of the iteration is \(z^{(1)} = {\hat{x}} \in {\mathbb {R}}^K\). Let \(z_i^{(l)}(x)\) denote the input value of the i-th neuron in the l-th layer and use \(z_{i} (x) = \lim _{{l \to \infty }} z_{i}^{{(l)}} (x)\) provided that this exists and is finite. In vector form, we have
$$\begin{aligned} {z^{(l)}(x)=\left( z_1^{(l)}(x),\dots ,z_K^{(l)}(x) \right) ^T\quad \text {and}\quad z (x)=\left( z_1 (x),\dots ,z_K (x) \right) ^T.} \end{aligned}$$Sometimes, for simplicity, we omit the arguments x.
-
Let \(a_i^{(l)}(x)\) denote the activation value of the i-th neuron in the l-th layer with the input vector x. We use the notation \(a_{i} (x) = \lim _{{l \to \infty }} a_{i}^{{(l)}} (x),\) provided that this exists and is finite. Accordingly, we use
$$\begin{aligned} f(z^{(l)})(x) = a^{(l)}(x)=\left( a_1^{(l)}(x),\dots ,a_K^{(l)}(x) \right) ^T \end{aligned}$$and \(a (x)=\left( a_1 (x),\dots ,a_K (x) \right) ^T.\)
-
Parallel with the formula (5), we also introduce \(d^{(\infty )} \in {\mathbb {R}}^K\) as
$$\begin{aligned} d^{(\infty )}_j = {\left\{ \begin{array}{ll} \left( a_j - {\tilde{y}}_j^{(m)} \right) {f_j}' (z_j), \quad K-N< j \le K\\ 0, \quad 1\le j \le K-N. \end{array}\right. } \end{aligned}$$Here, the activation function \(f_j: {\mathbb {R}} \rightarrow {\mathbb {R}}\), which is applied to the j-th neuron.
-
We use the notation
$$\begin{aligned} D_{i}^{(l)} = f'_{i} (z_{i}^{(l)}) \end{aligned}$$for the utility value of the i-th neuron in the l-th layer and \(D_{i} = \lim _{{l \to \infty }} D_{i}^{{(l)}}\) provided that it exists and is finite. We also define the diagonal matrix \(D \in {\mathbb {R}}^{K \times K}\) such that \(D = \textrm{diag} \left( D_{1}, \dots , D_{K} \right)\) and similarly, the diagonal matrices \(D^{(l)} \in {\mathbb {R}}^{K \times K}\) on the l-th layer.
3 Theoretical results
As discussed previously, we transform the original implicit network into an infinitely deep feedforward one. We apply the gradient backpropagation method in this network. To minimize, the error function (6) using some gradient-based method, we need to determine the partial derivatives \(\frac{\partial {{\mathcal {E}}^{(m)}}}{\partial w_{j,i}}\) and \(\frac{\partial {{\mathcal {E}}^{(m)}}}{\partial b_{j}}\) after calculating the equilibrium in iteration (7). In the following statement, we express these in concrete terms. We make use the gradient backpropagation method [13] by applying it first to a finite network, and then performing a limit transition with respect to the number of the layers. For our main result, we use the following assumptions.
Assumptions:
-
(i)
Equation (4) has a unique solution and the iteration in (7) is convergent such that we also have
$$\begin{aligned} \frac{\partial a_k}{\partial b_j} = \lim _{R\rightarrow \infty } \frac{\partial a_k^{(R),R}}{\partial b_j} \end{aligned}$$for all indices \(k = K-N+1,\dots , K\) and \(j= 1,\dots , K\).
-
(ii)
\(f_i \in C^1 ({\mathbb {R}})\), \(\forall i = 1, \dots , K\) and their derivatives are bounded.
-
(iii)
The linear mapping \(D W^T \in {\mathbb {R}}^{K \times K}\) is a contraction in some induced norm, i.e. \(\Vert D W^T \Vert <1\).
Theorem 1
Using assumptions in (i)–(iii), the system of equations
has a unique solution. Here, \(I \in {\mathbb {R}}^{K \times K}\) is the identity matrix. Furthermore, the partial derivatives of the error function \({\mathcal {E}}^{(m)}\) can be given as
Proof
Consider the finite network that consists of the first \(R\ge 2\) layers from the previously constructed infinite forward-connected network. Let \(z \in {\mathbb {R}}^K\) given as the initialization of the fixed point iteration in (7). With these, we have
or component wise, \(\displaystyle {z_j^{(l),R}=\sum _{k} w_{j,k} a_k^{(l-1),R}+b_j + {\hat{x}}^{(m)}_j}.\) Here, the letter R in the superscripts refers to the actual truncated finite network including R layers, where the gradient backpropagation is performed.
Using \((x^{(m)},y^{(m)})\) as an input–output pair, the error on the R-th layer of this truncated network is given by
where, we denote the z-dependence of the error. The partial derivative \(d_i^{(l),R}=\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z_i^{(l),R}}\) is defined for the output neurons and will be extended to the non-output ones. In the R-th layer, according to the classical algorithm for gradient backpropagation, we have
for the output (\(K-N< j \le K\)) and nonoutput (\(1\le j \le K-N\)) neurons, respectively.
For \(1\le l<R\), correspondig to the gradient backpropagation algorithm [13], we have
For calculating \(\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j}\), we have to sum up the above vectors \(d^{(l),R}\). This principle is similar to the one in backpropagation through time [14]. Thus, we get the following identity:
According to the identity in (11), we have
which can be inserted into (12) to get
Observe that this should be true also for the fixed point \(z\in {\mathbb {R}}^K\) of the iteration in (7). Since in this case, the diagonal matrices \(D^{(l)}\) \(1 \le l \le R\) coincide, denoting their common value with D, Eq. (14) is simplified to
Taking the limit \(R\rightarrow \infty\) in Eq. (15) and using assumptions (i) and (ii), we get the equation
, where the existence of the inverse follows from the assumption (iii).
We turn now to the statement for \(\frac{{\mathcal {E}}(m,z)}{\partial w_{j,i}}\). Similarly to (12), we have the following equality for arbitrary \(z \in {\mathbb {R}}^K\):
We can apply formula (13) again in (17) to get
We assume again, that \(z\in {\mathbb {R}}^K\) is the limit in (7). Therefore, \(D^{(l)}\equiv D\) holds \(\forall \le l \in {\mathbb {N}}\). With these, we can rewrite (18) as
Performing here limit transition with respect to the number of the layers again, we finally get
which completes the proof of the theorem.\(\square\)
The pseudocode of the gradient calculation can be found in Algorithm 2. Note, that here the product \(\left( I - D W^T \right) ^{-1} d^{(\infty )}\), is approximated using Neumann series.
3.1 Computational complexity
If \(\textrm{T}\) represents the maximum number of iterations as in Algorithm 2, and \(K\) denotes the number of neurons, it is easy to see that the computational complexity of both the network evaluation and gradient calculation is \(O(\textrm{T} \cdot K)\). In the case of a feedforward network, this is considerably more favorable since a calculation is performed exactly once on every neuron for both forward propagation and backpropagation, making the computational complexity for the feedforward case simply \(O(K)\). This implies that the training duration for an implicit network could substantially exceed that of a feedforward network.
4 Numerical experiments
We will classify pulsars in the HTRU2 dataset [15] and network intrusions in the NSL-KDD dataset [16]. The methodology involves using autoencoders to compress and reconstruct the multidimensional data, facilitating the identification of normal and anomalous signals. Anomalies are distinguished from typical noise by leveraging the reconstruction error as a metric. This approach demonstrates the effectiveness of autoencoders in detecting patterns within complex datasets and their utility in astronomical data analysis and in the detection of network intrusions.
The structure of this section is as follows. Firstly, the datasets under investigation are described, including the data preprocessing. Then, the applied evaluation metrics and then the training approach based on [17] will be described. Finally, the results of the numerical simulations are shown and discussed.
4.1 The investigated datasets
4.1.1 The HTRU2 dataset
The HTRU2 dataset [15] is a collection of pulsar candidates collected during the high time resolution universe survey. It consists of 17,898 total samples, with 1639 real pulsar examples. These samples are described by eight continuous variables and a class label that distinguishes between the pulsar and nonpulsar candidates. Pulsars are a rare type of neutron stars that emit radiation that can be observed on Earth, making them of significant interest in astrophysics.
We partition the dataset into learning and testing sets. The learning set contains 90% of the original data. Then, the SMOTE algorithm [18] is applied to the learning set. That is, the learning set consists of 29,282 elements in a balanced ratio. This resampled set is further divided into training and validation sets for cross-validation using the above ratio.
Then, a simple standardization is used to normalize the data based only on non-pulsar samples. The exact sizes of the sets and the element numbers of the classes are shown in Table 1.
4.1.2 The NSL-KDD dataset
The NSL-KDD dataset [16] is widely used to detect network intrusions. This dataset was created to address certain deficiencies of the KDD Cup 99 dataset [19], such as redundant records in the training and testing sets, which could distort the evaluation of machine learning models. It includes data from normal traffic and various types of attacks.
In the preprocessing phase, numerical representations are derived from non-numerical data to construct inputs suitable for the autoencoder. The features Protocol, Service, and Flag are identified as categorical. The Protocol feature, which specifies a network protocol, can assume the values tcp, udp, or icmp. Through the application of one-hot encoding, this feature is transformed into a 3-dimensional vector. The Service feature, encompassing 70 distinct categories, is represented by a vector with 70 entries. Similarly, the Flag feature, with coded with 11 entries. Consequently, the input data are formulated as vectors of 117 dimensions by combining 33 numerical features with 84 one-hot-encoded categorical features.
Subsequently, the dataset is partitioned into sets for learning and testing. The learning set, comprising 84.83% of the total data, is further partitioned into training and validation subsets to facilitate cross-validation. This arrangement adopts a splitting ratio of 63.62/21.21/15.17% for the training, validation, and test sets. Then, a simple standardization is used to normalize all the features by calculating the two required scalars, the mean and the standard deviation of the training set only for benign samples. Table 2 shows the precise distribution of class elements.
4.2 Evaluation metrics
The predictions made can vary for different values of the threshold c to belonging to the positive class. With this, it can be determined whether a prediction qualifies as true positive (\(\textrm{TP}_c\)), false positive (\(\textrm{FP}_c\)), true negative (\(\textrm{TN}_c\)) or false negative (\(\textrm{FN}_c\)). The following evaluation metrics are used based on these values.
-
Accuracy (\(A_c\)) It is the ratio of all correctly identified instances, positive and negative classes, to the total number of instances in the dataset with a given classification threshold c. This quantity is given by the formula
$$\begin{aligned} A_c = \frac{\textrm{TP}_c+\textrm{TN}_c}{\textrm{TP}_c + \textrm{TN}_c+ \textrm{FP}_c+ \textrm{FN}_c}. \end{aligned}$$ -
Precision (\(P_c\)) It is defined as the ratio of correctly identified positive instances and the total number of instances classified as positive. This metric evaluates the accuracy of positive predictions with a given classification threshold c as follows:
$$\begin{aligned} P_c = \frac{\textrm{TP}_c}{\textrm{TP}_c + \textrm{FP}_c}. \end{aligned}$$ -
Recall (\(R_c\)) or True Positive Rate (\(\textrm{TPR}_c\)) It quantifies the percentage of positive cases correctly identified with a given c classification threshold. It can be calculated using the following equation:
$$\begin{aligned} R_c = \textrm{TPR}_c = \frac{\textrm{TP}_c}{\textrm{TP}_c + \textrm{FN}_c}. \end{aligned}$$ -
Matthew’s Correlation Coefficient (\(\textrm{MCC}_c\)) We use this as the primary metric to represent the best performance the model can achieve at a fixed threshold c. It is in the range between \(-1\) and 1, where one indicates a perfect prediction and \(-1\) means all predictions are false. In concrete terms, this score is given by
$$\begin{aligned} \textrm{MCC}_c = \frac{\textrm{TP}_c \cdot \textrm{TN}_c - \textrm{FP}_c \cdot \textrm{FN}_c}{\sqrt{\left( \textrm{TP}_c+\textrm{FP}_c \right) \cdot \left( \textrm{TP}_c+\textrm{FN}_c \right) \cdot \left( \textrm{TN}_c+\textrm{FP}_c \right) \cdot \left( \textrm{TN}_c+\textrm{FN}_c \right) }}. \end{aligned}$$ -
F1-score (\(\textrm{F1}_c\)) It is the harmonic mean of the precision and recall values, it can be calculated as follows.
$$\begin{aligned} \textrm{F1}_c = 2 \cdot \frac{P_c \cdot R_c}{P_c+R_c} \end{aligned}$$
4.3 Training approach
Here, the steps of the suggested training approach are described, slightly modified from those in article [17].
-
1.
Preprocessing Initially, samples of the positive class are excluded from the training dataset. Feature values are scaled via standardization. Specific preprocessing techniques applied to the HTRU2 and NSL-KDD datasets are shown in detail in Sect. 4.1.
-
2.
Training The autoencoder is designed to reconstruct inputs so that they closely resemble normal patterns, utilizing the \(L_2\)-norm for calculating reconstruction loss, as introduced in the formula (6). Inputs corresponding to anomalies, when processed, are expected to deviate significantly from their original form, facilitating classification based on the disparity between the input and its output. A threshold differentiates anomalies from normal instances. During training, the Stochastic Gradient optimizer and Cosine learning rate scheduler [20] are utilized, incorporating Nesterov momentum and a weight decay of 0.0001. The initial learning rate is set at 0.01, while the final learning rate is adjusted to 0.001. For the HTRU2 dataset, training is conducted over 20 epochs with a minibatch size of 16, whereas for the NSL-KDD dataset, training is conducted for 5 epochs with a minibatch size of 32.
-
3.
Model Selection The \(\textrm{F1}\)-score is calculated at each epoch multiple times at a particular frequency. The highest \(\textrm{F1}\)-score’s model weights are utilized upon training completion. It is calculated through the selection of the threshold, as shown in the following step.
-
4.
Threshold Selection Selecting the threshold for reconstruction distance between input and output significantly impacts performance. The optimal threshold \({\hat{c}}\) is determined by evaluating the model on the validation set by maximizing the \(\textrm{F1}_c\)-score in c. Also, this maximal \(\textrm{F1}_c\) value is the \(\textrm{F1}\)-score mentioned in the previous step. This process involves standardizing reconstruction distances using mean and variance from negative samples in the validation set to determine the most effective threshold for class separation. This means the Z-score of the validation loss is calculated. The optimal threshold is sought within the \([-\,4,4]\) interval with a division of 0.001. According to the properties of the standard normal distribution, \(99.994\%\) of the scaled errors falls within the \([-\,4,4]\) interval. The Z-score that most effectively separates the anomalies from normal samples in the validation data is the threshold \({\hat{c}}\).
-
5.
Evaluation Performance is assessed using test data, classifying samples based on their Z-score relative to the threshold. Samples identified as anomalies exceed the threshold. Predicted labels are compared to actual labels to compute the model’s metrics.
4.4 The proposed implicit autoencoder models
Autoencoder networks for anomaly detection are operating by learning to replicate normal data input. They consist of three main components:
-
Encoder It compresses the input into a latent-space representation. It learns the most important features of the data, effectively reducing its dimensionality.
-
Decoder It reconstructs the input data from the latent space representation, aiming to compute output as close to the original input as possible.
-
Latent Layer This is the core of the autoencoder: a compressed knowledge representation of the input data between the encoder and decoder.
The proposed autoencoder models are shown in Fig. 3. In case of v1 models, every neuron in the latent layer is interconnected in a directed manner, with no edges leading back to itself. The activation function chosen for the encoder, decoder, and latent layers is the \(\arctan\) function. For v2 models, the latent layer consists of two blocks of neurons. The pointwise product of the two blocks constitutes the layer’s output, which is directed towards the decoder. The input to the first block of the latent layer is sourced from the last encoder layer. The activation function for the first block and the encoders and decoders is the \(\arctan\) function. In contrast, the activation function for the second block of the latent layer is the sigmoid function. Also, in v2 models, no edges lead from a neuron back to itself within the latent layer. With these choices, we examine whether adding extra weights to feedforward networks and, thus, getting implicit networks brings advantages by fixing the number of neurons. Here we made only the smallest, i.e., the latent layer implicit to keep computational costs at a relatively low level.
For a fair comparison, we consider six different feedforward autoencoder configurations, each with 8 and 16 neurons in the latent layer. We refer to these networks as the v0 model family. All the configurations are shown in Table 3. In the feedforward networks, the ReLu activation function is applied in the encoder, decoder, and latent layers.
Also, for the purpose of ensuring fair comparisons, two classic models, the Random Forest and XGBoost models, are selected.
4.5 Experiment environment
With our implementation, all models were executed on a single NVIDIA Geforce GTX 1080 8GB GPU and a Xeon X5670 CPU, utilizing Python 3.8 and the CUDA C programming language with CUDA version 11.4. We set all possible random seeds during our numerical experiments for reproducibility purposes. To account for the variability of random processes, we repeat each experiment ten times and report the best and average scores in Sect. 5.
5 Numerical results
5.1 The HTRU2 dataset
We summarize the results in Table 4. The two largest values for each model family are displayed in bold in each column. Also, the two largest values in each column are underlined. The implicit model, identified as \((5;32;16)-v1\), dominates four out of five test metrics. From the perspective of the dataset under consideration, the most crucial test metric is recall, for which the \((5;64;16)-v2\) model performs the best, achieving the highest score of 0.9244. Recall is crucial for the HTRU2 dataset, because minimizing false negative predictions is essential for pulsar detection. Nevertheless, the XGBoost and Random Forest models exhibit marginally higher performance in the other metrics. The confusion matrices created on the test set by the implicit \((5;64;8)-v1\) and \((5;64;16)-v2\) models can be seen in Fig. 4.
We have also studied the stability of the methods by computing the average test metrics over ten simulations. This can confirm the computations’ reliability in Table 4. The results are shown in Table 5. The results are observed to be very close to one another. The implicit v1 family dominates the CTR metric within the autoencoders, while the v2 family excels in the other test metrics. Overall, however, the XGBoost and Random Forest models yield the most consistent values.
5.2 The NSL-KDD dataset
The results are shown in Table 6. The two largest values for each model family are printed in bold in each column. Also, the two largest values in each column are underlined. The \((5;64;16)-v1\) model is observed to dominate in four out of five test metrics, including the CTR metric, which is also this dataset’s most crucial test metric. The performance of the best v2 model, namely the \((5;32;8)-v2\), is observed to fall behind only a little from this. The XGBoost and Random Forest models are observed to perform substantially weaker than the autoencoders here. The confusion matrices created on the test set by the implicit \((5;64;8)-v1\) and \((5;64;16)-v2\) models can be shown in Fig. 5.
In Table 7, the stability of the models is investigated. Here, the \((5;64;16)-v1\) and \((5;32;8)-v2\) implicit autoencoder are observed to dominate, too.
Unfortunately, a significant cost is associated with the implicit models. This can be shown in Table 8. The computational time is about four times as much as their feedforward variants.
The comparison with the XGBoost and Random Forest models in terms of computational time, this is only partially fair, as these were executed without the use of GPUs.
6 Future work
Expanding and deepening our numerical simulations is a priority in the future work, especially in scenarios where current frameworks like PyTorch [21] or TensorFlow [22] encounter limitations. Furthermore, we aim to understand better the impact of activation functions and the sparsity of computational graphs representing network architectures on the speed and efficiency of convergence. Additionally, accelerating computations is a crucial issue, for which we would like to develop further the algorithm proposed in the article [6].
7 Conclusion
In this work, we have shown an illustrative approach to constructing deep equilibrium models, also called implicit neural networks, highlighting that these networks are given by such a computational graph that may even include a directed cycle. We have proved a theorem for calculating the gradient in such a network, enabling the computation of gradients without resorting to the implicit function theorem, but by directly calculating them in an infinitely deep feedforward network associated with the computational graph. Furthermore, numerical experiments confirmed our findings, providing empirical evidence to support the theoretical results. This work lays a foundation for further exploration into the capabilities and applications of Implicit Neural Networks, marrying theoretical insights with practical validation.
Data availability
The implemented codes and the dataset used in this article are available on the GitHub repository https://github.com/szbela87/imp_autoencoder. On this page, one can also find the documentation of the developed CUDA C code.
References
El Ghaoui L, Gu F, Travacca B, Askari A, Tsai A (2021) Implicit deep learning. SIAM J Math Data Sci 3(3):930–958. https://doi.org/10.1137/20M1358517
Bianchini M, Gori M, Sarti L, Scarselli F (2006) Recursive neural networks and graphs: dealing with cycles. In: Apolloni B, Marinaro M, Nicosia G, Tagliaferri R (eds) Neural nets. Springer, Berlin, Heidelberg, pp 38–43
Bianchini M, Gori M, Scarselli F (2002) Recursive processing of cyclic graphs. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290), vol 1, pp 154–1591. https://doi.org/10.1109/IJCNN.2002.1005461
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605
Kawaguchi K (2021) On the theory of implicit deep learning: global convergence with implicit layers. In: International Conference on Learning Representations. https://openreview.net/forum?id=p-NZIuwqhI4
Bai S, Kolter JZ, Koltun V (2019) Deep equilibrium models. In: Wallach H, Larochelle H, Beygelzimer A, d’ alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc., Vancouver, Canada
Bai S, Koltun V, Kolter JZ (2020) Multiscale deep equilibrium models, vol 33. Vancouver, Canada, pp 5239–5250
Gu F, Chang H, Zhu W, Sojoudi S, El Ghaoui L (2020) Implicit graph neural networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc., Vancouver, Canada, pp 11984–11995
Steven G, Krantz HRP (2013) The implicit function theorem, 1st edn. Modern Birkhäuser classics. Birkhäuser, Boston. https://doi.org/10.1007/978-1-4614-5981-1
Winston E, Kolter JZ (2020) Monotone operator equilibrium networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc., Vancouver, Canada, pp 10718–10728
Geng Z, Zhang X-Y, Bai S, Wang Y, Lin Z (2021) On training implicit models. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW (eds) Advances in neural information processing systems, vol 34. Curran Associates Inc., Vancouver, Canada, pp 24247–24260
Bondy JA, Murty USR (2008) Graph theory, 1st edn. Springer, New York
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536. https://doi.org/10.1038/323533a0
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560. https://doi.org/10.1109/5.58337
Lyon R (2017) HTRU2. UCI machine learning repository. https://doi.org/10.24432/C5DK6R
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp 1–6. https://doi.org/10.1109/CISDA.2009.5356528
Song Y, Hyun S, Cheong Y-G (2021) Analysis of autoencoders for network intrusion detection. Sensors 21(13):4294. https://doi.org/10.3390/s21134294
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
Stolfo S, Fan W, Lee W, Prodromidis A, Chan P (1999) KDD Cup 1999 data. UCI machine learning repository. https://doi.org/10.24432/C51C7N
Loshchilov, I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc, Red Hook, NY, USA, pp 8026–8037
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. https://www.tensorflow.org/
Funding
Open access funding provided by Eötvös Loránd University. Béla J. Szekeres was supported by the Project No. 2019-1.3.1-KK-2019-00011 financed by the National Research, Development and Innovation Fund of Hungary under the Establishment of Competence Centers, Development of Research Infrastructure Programme funding scheme. Ferenc Izsák was supported by the National Research, Development and Innovation Office within the framework of the Thematic Excellence Program 2021 - National Research Sub programme: “Artificial intelligence, large networks, data security: mathematical foundation and applications”
Author information
Authors and Affiliations
Contributions
The authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest. We certify that the submission is original work and is not under review at any other publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Szekeres, B.J., Izsák, F. On the computation of the gradient in implicit neural networks. J Supercomput 80, 17247–17268 (2024). https://doi.org/10.1007/s11227-024-06117-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06117-6