Shao-Qun Zhanga,b,111Shao-Qun Zhang is the corresponding author. Email: [email protected]. Other authors made equal contributions.Zong-Yi ChenbYong-Ming TianbXun Luba National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
b School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China
(May 2, 2024)
Abstract
Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which characterizes the learning dynamics of neural networks with gradient descents and parameter initialization. The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.
While neural network learning is successful in a number of applications, it is not yet well understood theoretically (poggio2020theoretical, ). Recently, there has been an increasing amount of literature exploring the correspondence between infinite-wide neural networks and Gaussian processes (neal1996:GP, ). Researchers have identified equivalence between the two in various architectures (garriga2019:GP, ; novak2018:GP, ; yang2019:GP, ). This equivalence facilitates precise approximations of the behavior of infinite-wide Bayesian neural networks without resorting to variational inference. Relatively, it also allows for the characterization of the distribution of randomly initialized neural networks optimized by gradient descent, eliminating the need to actually run an optimizer for such analyses.
The standard investigation in this field encompasses the Neural Network Gaussian Process (NNGP) (lee2018:NNGP, ), which establishes that a neural network converges to a Gaussian process statistically as its width approaches infinity. The NNGP kernel inherently induces a posterior distribution that aligns with the feed-forward inference of infinite-wide Bayesian neural networks employing an i.i.d. Gaussian prior. Another typical work is the Neural Tangent Kernel (NTK) (jacot2018:NTK, ), where the function of a neural network trained through gradient descent converges to the kernel gradient of the functional cost as the width of the neural network tends to infinity. The NTK kernel captures the learning dynamic wherein learned parameters are closely tied to their initialization, resembling an i.i.d. Gaussian prior. These two kernels, derived from neural networks, exhibit distinct characteristics based on different initializations and regularization. A notable contrast lies in the fact that the NNGP, rooted in Bayesian inference, represents a zero-order kernel that are more suitable to describe the overall characteristics of neural network learning. In contrast, the NTK, rooted in the tangent space of gradient descents, is a first-order kernel that is adept at capturing local characteristics of neural network learning. Empirical evidence provided by Lee et al. (lee2020finite, ) demonstrates the divergent generalization performances of these two kernels across various datasets.
In this paper, we undertake an endeavor to unify both the NNGP and NTK kernels and present the Unified Neural Kernel (UNK) as a cohesive framework for neural network learning. By leveraging the learning dynamics associated with gradient descents and parameter initialization, we delve into theoretical characterizations, including but not limited to the existence, limiting properties, uniform tightness, and learning convergence of the proposed UNK kernel. Our theoretical investigations reveal that the UNK kernel exhibits behaviors reminiscent of the NTK kernel with a finite learning step and converges to the NNGP kernel as the learning step approaches infinity. This contribution not only significantly expands the scope of the existing elegant theory connecting kernel learning and neural network learning, but also represents a substantial step toward unraveling the true intricacies of deep learning.
Our main contributions can be summarized as follows:
•
We propose the UNK kernel, built upon the learning dynamics associated with gradient descents and parameter initialization, which unifies the limiting properties of both the NTK and NNGP kernels.
•
We theoretically investigate the asymptotic behaviors of the proposed UNK kernel, in which the UNK kernel is uniformly tight on the space of continuous functions and maintains a tight bound for the smallest eigenvalue.
•
We conduct experiments on benchmark datasets using various configurations. The numerical results further underscore the effectiveness of our proposed method.
The rest of this paper is organized as follows. Section 2 introduces useful notations, terminologies, and related studies. Section 3 presents the UNK kernel with in-depth discussions and proof sketches. Section 4 shows the uniform tightness and convergence of the UNK kernel. Section 5 conducts numerical experiments. Section 6 concludes our work.
2 Preliminary
This section will introduce useful notations, terminologies, and related studies.
2.1 Notations
Let be an integer set for , and denotes the number of elements in a collection, e.g., . Given two functions , we denote by if there exist positive constants , and such that for every ; if there exist positive constants and such that for every ; if there exist positive constants and such that for every . We define the globe for any . Let be the -dimensional identity matrix. Let be the norm of a vector or matrix, in which we employ as the default. Given and , we also define the sup-related measure as for .
Let be the space of continuous functions where . Provided a linear and bounded functional and a function which satisfies , then we have and according to General Transformation Theorem (van2000asymptotic, , Theorem 2.3) and Uniform Integrability (billingsley2013convergence, ), respectively.
Throughout this paper, we use the specific symbol to denote the concerned kernel for neural network learning. The superscript and stamp are used for recording the indexes of hidden layers and training epochs, respectively. We denote the Gaussian distribution by , where and indicate the mean and variance, respectively. In general, we employ and to denote the expectation and variance, respectively.
2.2 NNGP and NTK
We start this work with an -hidden-layer fully-connected neural networks, where and indicate the number of neurons in the -th hidden layer for and input, respectively, as follows
(1)
in which and indicate the variables of inputs respectively, and denote the pre-synaptic and post-synaptic variables of the -th hidden layer respectively, and are the parameter variables of connection weights and bias respectively, and is an element-wise activation function. For convenience, we here note the parameter variables at the -th epoch as , and denotes the initialized parameters, of which the value obeys the Gaussian distribution .
Neural Network Gaussian Process (NNGP). For any , there is a claim that the conditional variable obeys the Gaussian distribution. In detail, one has , where and denote the dot product and this equality holds according to , , and the mutual independence of elements and . It is reasonable to conjecture that according to the principle of mathematical induction and , where . Hence, one has
Moreover, the NNGP kernel is defined by
with
Neural Tangent Kernel (NTK). The training of the concerned ANNs consists in optimizing in the function space, supervised by a functional loss , such as the square or cross-entropy functions, where we employ to denote the variable of any parameter
For any , there is a claim that the gradient variable vector obeys the Gaussian distribution. Taking as an example, one has for , where adopts the dot operation. Hence, one has
where . Moreover, the NTK kernel is defined by
with
2.3 Related Studies
Past decades have witnessed a growing interest in the correspondence between neural network learning and Gaussian processes. Neal et al. (neal1996:GP, ) presented the seminal work by showing that a one-hidden-layer network of infinite width turns into a Gaussian process. Cho et al. (cho2009:GP, ) linked the multi-layer networks using rectified polynomial activation with compositional Gaussian kernels. Lee et al. (lee2018:NNGP, ) showed that the infinitely wide fully connected neural networks with common-used activation functions can converge to Gaussian processes. Recently, the NNGP has been scaled to many types of networks, including Bayesian networks (novak2018:GP, ), deep networks with convolution (garriga2019:GP, ), and recurrent networks (yang2019:GP, ).
NNGPs can provide a quantitative characterization of how likely certain outcomes are if some aspects of the system are not exactly known. In the experiments of (lee2018:NNGP, ), an explicit estimate in the form of variance prediction is given to each test sample. Besides, Pang et al. (pang2019:NNGP, ) showed that the NNGP is good at handling data with noise and is superior to discretizing differential operators in solving some linear or nonlinear partial differential equations. Park et al. (park2020:NNGP, ) employed the NNGP kernel in the performance measurement of network architectures for the purpose of speeding up the neural architecture search. Pleiss et al. (pleiss2022:NNGP, ) leveraged the effects of width on the capacity of neural networks by decoupling the generalization and width of the corresponding NNGP. Despite great progress, numerous studies about NNGP still rely on increasing width to induce the Gaussian processes. Recently, Zhang et al. (zhang2022:NNGP, ) proposed a depth paradigm that achieves an NNGP by increasing depth, providing complementary support for the existing theory of NNGP.
The NTK kernel, first proposed by Jacot et al. (jacot2018:NTK, ), relates a neural network trained by randomly initialized gradient descent with a Gaussian distribution. It has been proved that many types of networks, including graph neural networks on bioinformatics datasets (du2019:GNTK, ) and convolution neural network (arora2019:NTK, ) on medium-scale datasets like UCI database, can derive a corresponding kernel function. Some researchers applied NTK to various fields, such as federated learning (huang2021:NTK, ), mean-field analysis (mahankali2023:NTK, ), and natural language processing (malladi2023:NTK, ). Recently, Hron et al. (hron2020:attention, ) derived the NNGP and NTK from neural networks to multi-head attention architectures as the number of heads tends to infinity. Avidan et al. (avidan2023:connecting, ) provided a unified theoretical framework that connects NTK and NNGP using the Markov proximal learning model.
3 The Unified Kernel
This work considers a general form of supervised learning
(2)
where is a regularizer and is the corresponding multiplier. Based on gradient descent, Eq. (2) generally leads to a dynamical system with respect to parameter
(3)
where we omit the learning rate for simplicity. From Eq. (3), the value of can be regarded as a balance between the gradient and regularizer. In the next subsections, we will employ the initialized and epoch-related parameter to implement , where both regularization implementations induce the UNK kernel. Furthermore, Subsection 5.2 provides in-depth discussions about the effect of on the performance of the UNK kernel.
3.1 Initialization Parameter
In this work, we first consider leveraging the effects of initialized parameters222For example, one just employs the square regularizer in Eq. (3)., and thus Eq. (3) becomes
(4)
where is the initialized parameter and takes a tradeoff between parameter gradient and initialization.
Now, we present our main conclusion as follows.
Theorem 1
For a network of depth with a Lipschitz activation and in the limit of the layer width , Eq. (4) induces a kernel with the following form, for and ,
(5)
where is the correlation coefficients of variables along training epoch , and , and denote the variance and correlation coefficients of variables along training epoch 0 and , respectively. Furthermore, has the following properties of limiting kernels
(i)
For the case of or , the unified kernel is degenerated as the NTK kernel. Formally, for , the followings hold
(ii)
For the case of and , the unified kernel equals to the NNGP kernel, i.e., the following holds for as
Theorem 1 presents the existence and explicit formulation of the unified kernel that corresponds to Eq. (4) for neural network learning. For the case of or , the proposed kernel can be degenerated as the NTK kernel, where the parameter updating obeys the Gaussian distribution. Relatively, for the case of and , the proposed kernel can approximate the NNGP kernel well, which implies that a neural network model trained by Eq. (4) can reach an equilibrium state in a long-time regime. The proof sketch is listed in Subsection 3.3, and the full proof can be accessed in Appendix.
Similar to the NNGP and NTK kernels, the unified kernel is also of a recursive form, that is,
(6)
3.2 Epoch-related Parameter
From Eq. (6), it is observed that the unified kernel of the -th hidden layer at epoch can be computed recursively from a combination of the unified kernel of the -th hidden layer at epoch and the NNGP kernel of the -th hidden layer at epoch . Inspired by this recognition, we extend the fundamental formula in Eq. (4) as
(7)
given . Obviously, Eq. (7) has a general updating formulation, taking Eq. (4) as a special case of . However, Eq. (7) leads to a more general updating paradigm. For example, may indicate a collection of pre-given parameters from pre-training or meta-learning, so that Eq. (7) becomes an optimization computation for fine-tuning. Further, the derived kernel may support the theoretical analysis of the fine-tuning learning after pre-training. The effectiveness of Eq. (7) will be demonstrated in Section 5.
We directly provide the theoretical framework of unified kernels relative to the parameter updating in Eq. (7).
Theorem 2
For a network of depth with a Lipschitz activation and in the limit of the layer width , Eq. (7) induces a kernel with the following form, for and ,
(8)
where denotes the correlation coefficient of variables along training epochs and , and and are the corresponding variances. Furthermore, the unified kernel has the following properties
(i)
For the case of or , the unified kernel degenerates as the NTK kernel, that is, for
(ii)
For the case of and , the unified kernel equals to the NNGP kernel, i.e., the following holds for as ,
Theorem 2, a general extension of Theorem 1, presents a unified kernel for neural network learning with Eq. (7). For the case of or , the proposed kernel can be degenerated as the NTK kernel, where the parameter updating obeys the Gaussian distribution. Relatively, for the case of and , the proposed kernel can approximate the NNGP kernel well, which implies that a neural network model trained by Eq. (7) can reach an equilibrium state in a long time regime. We provide a proof sketch in Subsection 3.3; the full proof can be accessed in Appendix.
It is observed that the unified kernel led by Eq. (7) can be re-written in a recursive form
(9)
3.3 Proof Sketch
It is obvious that Eq. (4) is a special case of Eq. (7) when one forces . We start this proof with unfolding Eq. (7) in the following discrete form
where and represent the epoch stamps in which denotes the epoch infinitesimal. According to the mathematical induction, we can employ drawn from the Gaussian distribution . By direct computations, we have
where . Notice that is almost independent to as . It is observed that converges as and . Thus, the variable sequence is bounded. Here, we define that and . Let denote the probability density function of . Thus, we have
(10)
with
where and indicates the Dirac-delta function. According to the independence, one has
(11)
where . Thus, we can claim that obeys the Gaussian distribution with zero mean, which completes the mathematical induction.
All statistics of post-synaptic variables can be calculated via the moment generating function . Here, we focus on the second moment of for and , that is,
(12)
By substituting Eq. (10) into Eq. (12), we can obtain the concerned kernel
It is observed that Eq. (5) equals the NTK kernel in the case of and . Similarly, it is easily proved that
where . The above formula reveals that a smaller absolute value of may lead to a larger convergence rate. Thus, we have
as . The detailed proof can be accessed in the Appendix.
4 Uniform Tightness and Convergence
Here, we provide two theorems to further show the theoretical properties of the proposed NUK kernel.
4.1 Uniform Tightness of NNGP(d)
Now, we present the following theorem.
Theorem 3
For any , the unified kernel , described in Theorem 2, is uniformly tight in .
Theorem 3 delineates the asymptotic behavior of as for , revealing an intrinsic characteristic of uniform tightness. Based on Theorem 3, one can obtain the properties of functional limit and continuity of , in analogy to those of bracale2020:asymptotic .
Theorem 3 establishes upon three useful lemmas from (zhang2022:NNGP, ).
Lemma 4.4
Let denote a sequence of random variables in . This stochastic process is uniformly tight in , if the following two hold:
(1) is a uniformly tight point of () in ;
(2) for any , and , there exist , such that
Lemma 4.4 shows core guidance for proving Theorem 3.
Lemma 4.5
Based on the notations of Lemma 4.4, is a uniformly tight point of () in .
The convergence in distribution from Lemma 4.5 paves the way for the convergence of expectations.
Lemma 4.6
Based on the notations of Lemma 4.4, for any and , there exist , such that .
The proofs of lemmas above can be accessed from Appendix D. Notice that the above lemmas take the stochastic process of hidden neuron vectors with increasing epochs regardless of the layer index, i.e., the above lemmas hold for . For the case of two stamps and where , the concerned stochastic process becomes , and thus the above conclusions also hold. Therefore, Theorem 3 can be completely proved by invoking Lemmas 4.5 and 4.6 into Lemma 4.4.
4.2 Tight Bound for the Smallest Eigenvalue
In this subsection, we investigate the learning convergence of the UNK kernel. The key idea is to bind the small eigenvalues of for since the learning convergence is related to the positive definiteness of the limiting neural kernels. Here, we consider the neural networks equipped with ReLU activation and then draw the following conclusion.
Theorem 4.7
Let be i.i.d. sampled from , which satisfies that , , , and . For an integer , with probability , we have
for , where denotes the smallest eigenvalue and
Theorem 4.7 provides a tight bound for the smallest eigenvalue of the UNK kernel , which is closely related to the training convergence of neural networks. This nontrivial estimation mirrors the characteristics of this kernel, and usually be used as a key assumption for optimization and generalization. The key idea of proving Theorem 4.7 is based on the following inequalities about the smallest eigenvalue of real-valued symmetric square matrices. Given two symmetric matrices , it is observed that
(13)
From Eq. (9), we can unfold as a sum of covariance of the sequence of random variables . Thus, we can bound by via a chain of feedforward compositions in Eq. (1). For conciseness, we put the proof of Theorem 4.7 into Appendix E.
5 Experiments
In this section, we conduct several experiments to evaluate the effectiveness of the proposed UNK kernel.
5.1 Datasets and Configurations
Following the experimental configurations of Lee et al. (lee2018:NNGP, ), we conduct the empirical evaluations on a two-hidden-layer MLP trained with various . The conducted dataset is the MNIST handwritten digit data, which comprises a training set of 60,000 examples and a testing set of 10,000 examples in 10 classes, where each example is centered in a image.
For the classification tasks, the class labels are encoded into an opposite regression formation, where the correct label is marked as 0.9 and the incorrect one is marked as 0.1 (zhang2022:NNGP, ). Here, we employ 5000 hidden neurons and the softmax activation function. Similar to (arora2019:NNGP, ), all weights are initialized with a Gaussian distribution of the mean 0 and variance for . We also force the batch size and the learning rate as 64 and 0.001, respectively. All experiments were conducted on Intel Core-i7-6500U.
5.2 Experiments for Effects of Various Multipliers
The experiments aim to leverage the effects of various on the performance of the UNK kernel. According to the recursive formulation of , it is evident that balances the gradient and regularizer. From the perspective of theoretical effects, the absolute value of indicates not only the limiting convergence rate of but also the optimal solution of Eq. (2). Provided , we can compute the optimal solution at current epoch stamp as follows
(14)
where . This optimization problem can be solved by some mature algorithms, such as Bayesian optimization or grid search. Here, we conjecture that is an effective indicator for identifying the optimal trajectory of the UNK kernel.
Here, we set the investigated values of the multiplier to and employ three types of studied models as follows
where the optimization problem in Eq. (14) is solved by gird search with the granularity of 0.001 and 0.01, which are denoted as Grid 0.001 and Grid 0.01, respectively.
Figure 1 draws various multipliers and the corresponding accuracy curves. There are several observations that (1) the performance of the training algorithms led by Eq. (2) is comparable to those of typical gradient descent in various configurations, (2) and are too large to hamper the performance of the UNK kernel, and (3) Grid 0.01 provides a starting point for higher accuracy and achieves the fastest convergence speed and best accuracy. The above observations not only show the effectiveness of our proposed UNK kernel, but also coincide with our theoretical conclusions that the UNK kernel converges to the NNGP kernel as and a smaller value of may lead to a larger convergence rate.
In detail, Table 1 lists the optimal trajectory and the corresponding training accuracy of Grid 0.001 and Grid 0.01 over the epoch. It is observed that (1) the optimal trajectory of the UNK kernel and the path of typical gradient descent are not completely consistent, and (2) both Grid 0.001 and Grid 0.01 achieve faster convergence speed and better accuracy than those of the baseline methods. These results further demonstrate the effectiveness of our proposed UNK kernel.
Epoch
Baseline
Grid 0.001
Grid 0.01
ACC.
ACC.
ACC.
1
0.1289
0.0100
0.9257
0.0800
0.9266
2
0.9256
0.0020
0.9506
0.0800
0.9521
3
0.9504
0.0040
0.9631
0.0900
0.9656
4
0.9629
0.0080
0.9708
0.0700
0.9737
5
0.9705
0.0070
0.9766
0.0900
0.9793
6
0.9763
0.0050
0.9802
0.1000
0.9839
7
0.9800
0.0060
0.9834
0.1000
0.9870
8
0.9831
0.0000
0.9858
0.0800
0.9899
9
0.9855
0.0080
0.9879
0.0500
0.9922
10
0.9875
0.0000
0.9898
0.0900
0.9939
11
0.9896
0.0000
0.9913
0.0600
0.9952
12
0.9910
0.0000
0.9923
0.0600
0.9963
13
0.9922
0.0040
0.9933
0.0700
0.9971
14
0.9931
0.0020
0.9943
0.0800
0.9977
15
0.9941
0.0020
0.9952
0.0500
0.9984
16
0.9949
0.0080
0.9959
0.0700
0.9987
17
0.9957
0.0060
0.9966
0.0900
0.9992
18
0.9963
0.0070
0.9972
0.0700
0.9995
19
0.9969
0.0070
0.9977
0.0000
0.9996
20
0.9974
0.0100
0.9981
0.0800
0.9998
21
0.9978
0.0070
0.9984
0.0100
0.9997
22
0.9982
0.0100
0.9986
0.0200
0.9999
23
0.9984
0.0050
0.9987
0.0000
0.9999
24
0.9986
0.0000
0.9989
0.0000
0.9999
25
0.9988
0.0050
0.9990
0.0000
0.9999
26
0.9989
0.0030
0.9992
0.0000
1.0000
Table 1: Illustration of and the corresponding training accuracy (ACC.) of Grid 0.001 and Grid 0.01 over epoch .
5.3 Experiments for the UNK kernel
This experiment investigates the representation ability of our proposed UNK kernel. The indicator is computed as
where indicates the -th instance, and denotes the UNK kernel trained by solving Eq. (14)
The value of manifests the correlation between outputs of the UNK kernels with initialized and optimized parameters. According to the theoretical results in Section 3, the UNK kernel is said to be valid if the kernel outputs brought by initialized and optimized parameters are markedly discriminative. In other words, a valid UNK is able to classify digits well in this experiment, and thus should equal , where the first 0.1 and 1 denote the accuracy of the UNK with initialized and optimized parameters, respectively. Ideally, the value of in this experiment should trend towards 0.1, that is, . If comes near one, the kernel cannot recognize the difference between the kernel output brought by initialized and optimized parameters, and thus the kernel is invalid.
Figure 2 displays the (training and testing) correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of 0.001 and 0.01. It is observed that the average training correlation values of Grid 0.001 and Grid 0.01 are almost 0.13 as training accuracy goes to 100%, which implies that the trained UNK kernel is valid for classifying MNIST. This is a laudable result for the theory and development of neural kernel learning.
Notice that the average training correlation values for Grid 0.001 and Grid 0.01 are not precisely equal to 0.1, and the average testing correlation values for Grid 0.001 and Grid 0.01 are approximately 0.2 instead of the stated value of 0.1. These discrepancies could be attributed to several factors, including gaps between the softmax and labeled vectors and out-of-distribution errors. More detailed experimental results are listed in Appendix F.
6 Conclusions
In this paper, we proposed the UNK kernel, a unified framework for neural network learning that draws upon the learning dynamics associated with gradient descents and parameter initialization. Our investigation explores theoretical aspects, such as the existence, limiting properties, uniform tightness, and learning convergence of the proposed UNK kernel. Our main findings highlight that the UNK kernel exhibits behaviors akin to the NTK kernel with a finite learning step and converges to the NNGP kernel as the learning step approaches infinity. Experimental results further emphasize the effectiveness of our proposed method.
Impact Statements
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.
References
(1)
S. Arora, S. S. Du, W. hu, Z. Li, R. R. Salakhutdinov, and R. Wang.
On exact computation with an infinitely wide neural net.
In Advances in Neural Information Processing Systems 32, pages
8141–8150, 2019.
(2)
S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang.
Fine-grained analysis of optimization and generalization for
overparameterized two-layer neural networks.
In Proceedings of the 36th International Conference on Machine
Learning, pages 322–332, 2019.
(3)
Y. Avidan, Q. Li, and H. Sompolinsky.
Connecting NTK and NNGP: A unified theoretical framework for
neural network learning dynamics in the kernel regime.
arXiv:2309.04522, 2023.
(4)
P. Billingsley.
Convergence of Probability Measures.
John Wiley & Sons, 2013.
(5)
D. Bracale, S. Favaro, S. Fortini, and S. Peluchetti.
Large-width functional asymptotics for deep gaussian neural networks.
In Proceedings of the 8th International Conference on Learning
Representations, 2020.
(6)
Y. Cho and L. Saul.
Kernel methods for deep learning.
In Advances in Neural Information Processing Systems 22, pages
342–350, 2009.
(7)
S. S. Du, K. Hou, R. R. Salakhutdinov, B. Poczos, R. Wang, and K. Xu.
Graph neural tangent kernel: Fusing graph neural networks with graph
kernels.
In Advances in Neural Information Processing Systems 32, pages
5723 – 5733, 2019.
(8)
A. Garriga-Alonso, C. Rasmussen, and L. Aitchison.
Deep convolutional networks as shallow gaussian processes.
In Proceedings of the 7th International Conference on Learning
Representations, 2019.
(9)
J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak.
Infinite attention: NNGP and NTK for deep attention networks.
In Proceedings of the 37th International Conference on Machine
Learning, pages 4376–4386, 2020.
(10)
B. Huang, X. Li, Z. Song, and X. Yang.
FL-NTK: A neural tangent kernel-based framework for federated
learning analysis.
In Proceedings of the 38th International Conference on Machine
Learning, pages 4423–4434, 2021.
(11)
A. Jacot, F. Gabriel, and C. Hongler.
Neural tangent kernel: Convergence and generalization in neural
networks.
In Advances in Neural Information Processing Systems 31, pages
8580 – 8589, 2018.
(12)
J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and
J. Sohl-Dickstein.
Deep neural networks as gaussian processes.
In Proceedings of the 6th International Conference on Learning
Representations, 2018.
(13)
J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and
J. Sohl-Dickstein.
Finite versus infinite neural networks: An empirical study.
In Advances in Neural Information Processing Systems 33, pages
15156–15172, 2020.
(14)
A. Mahankali, J. Z. Haochen, K. Dong, M. Glasgow, and T. Ma.
Beyond NTK with vanilla gradient descent: A mean-field analysis of
neural networks with polynomial width, samples, and time.
arXiv:2306.16361, 2023.
(15)
S. Malladi, A. Wettig, D. Yu, D. Chen, and S. Arora.
A kernel-based view of language model fine-tuning.
In Proceedings of the 40th International Conference on Machine
Learning, pages 23610–23641, 2023.
(16)
M. Mézard, G. Parisi, and M. A. Virasoro.
Spin glass theory and beyond: An Introduction to the Replica
Method and Its Applications.
World Scientific Publishing Company, 1987.
(17)
R. M. Neal.
Priors for infinite networks.
Bayesian Learning for Neural Networks, pages 29–53, 1996.
(18)
Q. Nguyen, M. Mondelli, and G. Montufar.
Tight bounds on the smallest eigenvalue of the neural tangent kernel
for deep relu networks.
In Proceedings of the 38th International Conference on Machine
Learning, pages 8119–8129, 2021.
(19)
R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abolafia,
J. Pennington, and J. Sohl-dickstein.
Bayesian deep convolutional networks with many channels are gaussian
processes.
In Proceedings of the 6th International Conference on Learning
Representations, 2018.
(20)
G. Pang, L. Yang, and G. E. Karniadakis.
Neural-net-induced gaussian process regression for function
approximation and PDE solution.
Journal of Computational Physics, 384:270–288, 2019.
(21)
D. S. Park, J. Lee, D. Peng, Y. Cao, and J. Sohl-Dickstein.
Towards NNGP-guided neural architecture search.
arXiv:2011.06006, 2020.
(22)
G. Pleiss and J. P. Cunningham.
The limitations of large width in neural networks: A deep gaussian
process perspective.
In Advances in Neural Information Processing Systems 34, pages
3349–3363, 2021.
(23)
T. Poggio, A. Banburski, and Q. Liao.
Theoretical issues in deep networks.
Proceedings of the National Academy of Sciences,
117(48):30039–30045, 2020.
(24)
Hector N Salas.
Gershgorin’s theorem for matrices of operators.
Linear Algebra and its Applications, 291(1-3):15–36, 1999.
(25)
D. Stroock and S. Varadhan.
Multidimensional Diffusion Processes.
Springer Science & Business Media, 1997.
(26)
A. W. Van der Vaart.
Asymptotic Statistics.
Cambridge University Press, 2000.
(27)
G. Yang.
Tensor programs I: Wide feedforward or recurrent neural networks of
any architecture are gaussian processes.
In Advances in Neural Information Processing Systems 32, pages
9951–9960, 2019.
(28)
S.-Q. Zhang, F. Wang, and F.-L. Fan.
Neural network gaussian processes by increasing depth.
IEEE Transactions on Neural Networks and Learning Systems,
2022.
(29)
S.-Q. Zhang and Z.-H. Zhou.
Arise: Aperiodic semi-parametric process for efficient markets
without periodogram and gaussianity assumptions.
arXiv:2111.06222, 2021.
Appendix
This appendix provides the supplementary materials for our work “A Unified Kernel for Neural Network Learning”, constructed according to the corresponding sections therein. Before that, we first review the useful notations. Let be an integer set for , and denotes the number of elements in a collection, e.g., . Given two functions , we denote by if there exist positive constants , and such that for every ; if there exist positive constants and such that for every ; if there exist positive constants and such that for every . We define the globe for any . Let be the -dimensional identity matrix. Let be the norm of a vector or matrix, in which we employ as the default. Given and , we also define the sup-related measure as for .
Let be the space of continuous functions where . Provided a linear and bounded functional and a function which satisfies , then we have and according to General Transformation Theorem [26, Theorem 2.3] and Uniform Integrability [4], respectively.
Throughout this paper, we use the specific symbol to denote the concerned kernel for neural network learning. The superscript and stamp are used for recording the indexes of hidden layers and training epochs, respectively. We denote the Gaussian distribution by , where and indicate the mean and variance, respectively. In general, we employ and to denote the expectation and variance, respectively.
Appendix A Theoretical Derivations of NNGP and NTK
A.1 NNGP and NTK
Here, we consider an -hidden-layer fully-connected neural networks, where and indicate the number of neurons in the -th hidden layer for and input, respectively, as follows
in which and indicate the variables of inputs respectively, and denote the pre-synaptic and post-synaptic variables of the -th hidden layer respectively, and are the parameter variables of connection weights and bias respectively, and is an element-wise activation function. For convenience, we here note the parameter variables at the -th epoch as , and denotes the initialized parameters, of which the element obeys the Gaussian distribution .
Neural Network Gaussian Process (NNGP). For any , there is a claim that the conditional variable obeys the Gaussian distribution. In detail, one has
where and denote the dot product, and the forth equality holds according to , and the elements of and are mutually independent. According to , it is reasonable to assume that according to the principle of mathematical induction, where
Hence, one has
Moreover, the NNGP kernel is defined by
with
In summary, we conclude the recursive form of the NNGP kernel as follows
Neural Tangent Kernel (NTK). The training of the concerned ANNs consists in optimizing in the function space, supervised by a functional loss , such as the square or cross-entropy functions, where we employ to denote the variable of any parameter
The loss is monotonically decreasing as the training epoch since
For any , there is a claim that the gradient variable vector obeys the Gaussian distribution. In detail, for , one has
All statistics of post-synaptic variables can be calculated via the moment generating function
Here, we focus on the second moment of for and , that is,
In the above equations, and denote the variables of hidden states and parameters, respectively. Let denote the probability density function of . According to the formulation of , we should compute the probability density function . For convenience, we abbreviate as throughout this proof.
According to the introduction in Section 3, Eq. (7) has a general updating formulation, taking Eq. (4) as a special case of . Hence, we here take a general formula as follows
where denotes the epoch infinitesimal. Here, we omit the learning rate for simplicity. Thus, we have
with
where , , and indicates the Dirac-delta function. Besides, one has
Notice that is almost independent to as . It is observed that converges as and . Thus, the variable sequence is bounded. Here, we define that
Throughout this proof, we have a mild assumption of for simplicity; Otherwise, we usually employ , instead of the above assumption, where denotes the correlation coefficient between variables of hidden states and .
Moreover, we have
where . Thus, we can conjecture that obeys the Gaussian distribution with zero mean. Suppose that and
Thus, we have
where corresponds to . The above equation can be extended to the vectorized formulation in detail, where provided and , one has
and
where indicates the corresponding variance matrix. Furthermore, provided two stamps and , we have
where
and
in which denotes the correlation coefficient between and . The estimation of the second moment has been written as a general formula, which can be solved by some mature statistical methods, such as the replica calculation [16].
By direct calculations, we can obtain the concerned kernel
or
for . Here, and are variables led by and , respectively. Similar to the NNGP and NTK kernels, the unified kernel is also of a recursive form as follows:
(15)
Next, we will analyze the limiting properties of .
•
In the case of , it is obvious that
and thus, Eq. (5) is degenerated as the NTK kernel
We provide another proof that originates from Eq. (4) with in Appendix C.
where denotes the differential operation with respect to . Thus, for any , it is easy to prove that
Here, we consider that the correlation coefficient is negatively proportional to since the variable correlation becomes smaller as the stamp gap increases. Generally, we employ
Thus, we can obtain
in which we omit the constant multiplier.
Considering the mild assumption of , as mentioned above, we can further simplify these conclusions from
This completes the proof.
Appendix C For the case of
For the case of , we can update from
Here, we omit the learning rate for simplicity. For convenience, we abbreviate as . It is observed that
It is observed that converges as and . Thus, the variable sequence is bounded. Here, we define that
Let denote the probability density function of . Thus, we have
with
where , , and indicates the Dirac-delta function. Thus, it is feasible to conjecture that obeys the Gaussian distribution with zero mean. We define .
Thus, the second moment in becomes
where for and . Based on the above equations, we can obtain the concerned kernel
which coincides with the theory of NTK and our proposed unified kernel.
Appendix D Uniform Tightness of
Lemma 4.4 can be straightforwardly derived from Kolmogorov Continuity Theorem [25], provided the Polish space .
This proof follows mathematical induction. Before that, we show the following preliminary result. Let be one element of the augmented matrix at the -th layer, then we can formulate its characteristic function as
where denotes the imaginary unit with . Thus, the variance of hidden random variables at the -th layer becomes
(16)
Next, we provide two useful definitions from [28].
Definition D.8
A function is said to be well-posed, if is first-order differentiable, and its derivative is bounded by a certain constant . In particular, the commonly used activation functions like ReLU, tanh, and sigmoid are well-posed (see Table 2).
Table 2: Well-posedness of the commonly-used activation functions.
Activations
Well-Posedness
ReLU
sigmoid
Definition D.9
A matrix is said to be stable-pertinent for a well-posed activation function , in short , if the inequality holds.
Since the activation is a well-posed function and , we affirm that is Lipschitz continuous (with Lipschitz constant ). Now, we start the mathematical induction. When , for any , we have
The above induction holds for any positive even . Let , then this lemma is proved as desired.
Appendix E Tight Bound for Convergence
We begin this proof with the following lemmas.
Lemma E.10
Let be a Lipschitz continuous function with constant and denote the Gaussian distribution , then for , there exists , s.t.
(18)
Lemma E.10 shows that the Gaussian distribution corresponding to our samples satisfies the log-Sobolev inequality, i.e., Eq. (18), with some constants unrelated to dimension . This result also holds for the uniform distributions on the sphere or unit hypercube [18].
Lemma E.11
Suppose that are i.i.d. sampled from , then with probability , we have
for , where
From Definition 1 of the manuscript, we have
Since are i.i.d. sampled from , for , we have with probability at least . Provided , the single-sided inner product is Lipschitz continuous with the constant . As such, from Lemma E.10, for , we have
We start this proof with some notations. For convenience, we force , or equally, . We also abbreviate the covariance as throughout this proof.
Unfolding the kernel equation that omits the epoch stamp
(19)
we have
(20)
where
in which the subscript indicates the -th element of vector . From Theorem 1 of the manuscript, the sequence of random variables is weakly dependent with as . Thus, is an infinitesimal with respect to when .
Iterating Eq. (22) and then invoking it into Eq. (21), we have
(23)
From the Hermite expansion [29] of ReLU function, we have
(24)
where indicates the expansion order. Thus, we have
(25)
where the superscript denotes the -th Khatri Rao power of the matrix , the first inequality follows from Eq. (24), the second one holds from Gershgorin Circle Theorem [24], and the third one follows from Lemma E.11. Therefore, we can obtain the lower bound of the smallest eigenvalue by plugging Eq. (25) into Eq. (23).
On the other hand, it is observed from Lemma 4.4 that for ,
(26)
Thus, we have
where the second inequality follows from Eq. (20), the third one follows from Eq. (26), and the fourth one holds from Lemma E.11. This completes the proof.
Appendix F Supplementary Experimental Results
This section provides the detailed experimental results.
Table 3 lists the optimal trajectory and the corresponding testing accuracy of Grid 0.001 and Grid 0.01 over the epoch. Figure 3 draws the training correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of . Figure 4 draws the testing correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of .
Epoch
Baseline
Grid 0.001
Grid 0.01
Testing ACC.
Training ACC.
Testing ACC.
Training ACC.
Testing ACC.
Training ACC.
1
0.1325
0.1289
0.0100
0.9287
0.9257
0.0800
0.9291
0.9266
2
0.9284
0.9256
0.0020
0.9515
0.9506
0.0800
0.9527
0.9521
3
0.9514
0.9504
0.0040
0.9607
0.9631
0.0900
0.9631
0.9656
4
0.9603
0.9629
0.0080
0.9665
0.9708
0.0700
0.9693
0.9737
5
0.9658
0.9705
0.0070
0.9709
0.9766
0.0900
0.9729
0.9793
6
0.9705
0.9763
0.0050
0.9738
0.9802
0.1000
0.9757
0.9839
7
0.9733
0.9800
0.0060
0.9756
0.9834
0.1000
0.9785
0.9870
8
0.9753
0.9831
0.0000
0.9772
0.9858
0.0800
0.9795
0.9899
9
0.9769
0.9855
0.0080
0.9789
0.9879
0.0500
0.9805
0.9922
10
0.9788
0.9875
0.0000
0.9798
0.9898
0.0900
0.9818
0.9939
11
0.9800
0.9896
0.0000
0.9809
0.9913
0.0600
0.9826
0.9952
12
0.9809
0.9910
0.0000
0.9814
0.9923
0.0600
0.9833
0.9963
13
0.9813
0.9922
0.0040
0.9814
0.9933
0.0700
0.9833
0.9971
14
0.9814
0.9931
0.0020
0.9815
0.9943
0.0800
0.9837
0.9977
15
0.9815
0.9941
0.0020
0.9815
0.9952
0.0500
0.9841
0.9984
16
0.9814
0.9949
0.0080
0.9819
0.9959
0.0700
0.9848
0.9987
17
0.9816
0.9957
0.0060
0.9824
0.9966
0.0900
0.9847
0.9992
18
0.9818
0.9963
0.0070
0.9827
0.9972
0.0700
0.9851
0.9995
19
0.9825
0.9969
0.0070
0.9830
0.9977
0.0000
0.9850
0.9996
20
0.9824
0.9974
0.0100
0.9833
0.9981
0.0800
0.9857
0.9998
21
0.9831
0.9978
0.0070
0.9834
0.9984
0.0100
0.9847
0.9997
22
0.9830
0.9982
0.0100
0.9838
0.9986
0.0200
0.9850
0.9999
23
0.9831
0.9984
0.0050
0.9835
0.9987
0.0000
0.9847
0.9999
24
0.9834
0.9986
0.0000
0.9836
0.9989
0.0000
0.9843
0.9999
25
0.9835
0.9988
0.0050
0.9830
0.9990
0.0000
0.9848
0.9999
26
0.9837
0.9989
0.0030
0.9838
0.9992
0.0000
0.9845
1.0000
27
0.9834
0.9990
0.0000
0.9834
0.9992
0.0000
0.9852
1.0000
28
0.9833
0.9991
0.0000
0.9839
0.9994
0.0000
0.9848
1.0000
29
0.9834
0.9993
0.0000
0.9834
0.9994
0.0000
0.9848
1.0000
30
0.9836
0.9993
0.0020
0.9838
0.9995
0.0000
0.9850
1.0000
Table 3: Illustration of and the corresponding (both training and testing) accuracy (ACC.) of Grid 0.001 and Grid 0.01 over epoch .