A Unified Kernel for Neural Network Learning

Shao-Qun Zhang^a,b,¹¹1Shao-Qun Zhang is the corresponding author. Email: [email protected]. Other authors made equal contributions. Zong-Yi Chen^b Yong-Ming Tian^b Xun Lu^b ^a National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
^b School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China

(May 2, 2024)

Abstract

Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which characterizes the learning dynamics of neural networks with gradient descents and parameter initialization. The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.

keywords:

Neural Network Learning \sepUnified Neural Kernel \sepNeural Network Gaussian Process \sepNeural Tangent Kernel \sepGradient Descent \sepUniform Tightness \sepConvergence \sepOptimal Trajectory

1 Introduction

While neural network learning is successful in a number of applications, it is not yet well understood theoretically (poggio2020theoretical, ). Recently, there has been an increasing amount of literature exploring the correspondence between infinite-wide neural networks and Gaussian processes (neal1996:GP, ). Researchers have identified equivalence between the two in various architectures (garriga2019:GP, ; novak2018:GP, ; yang2019:GP, ). This equivalence facilitates precise approximations of the behavior of infinite-wide Bayesian neural networks without resorting to variational inference. Relatively, it also allows for the characterization of the distribution of randomly initialized neural networks optimized by gradient descent, eliminating the need to actually run an optimizer for such analyses.

The standard investigation in this field encompasses the Neural Network Gaussian Process (NNGP) (lee2018:NNGP, ), which establishes that a neural network converges to a Gaussian process statistically as its width approaches infinity. The NNGP kernel inherently induces a posterior distribution that aligns with the feed-forward inference of infinite-wide Bayesian neural networks employing an i.i.d. Gaussian prior. Another typical work is the Neural Tangent Kernel (NTK) (jacot2018:NTK, ), where the function of a neural network trained through gradient descent converges to the kernel gradient of the functional cost as the width of the neural network tends to infinity. The NTK kernel captures the learning dynamic wherein learned parameters are closely tied to their initialization, resembling an i.i.d. Gaussian prior. These two kernels, derived from neural networks, exhibit distinct characteristics based on different initializations and regularization. A notable contrast lies in the fact that the NNGP, rooted in Bayesian inference, represents a zero-order kernel that are more suitable to describe the overall characteristics of neural network learning. In contrast, the NTK, rooted in the tangent space of gradient descents, is a first-order kernel that is adept at capturing local characteristics of neural network learning. Empirical evidence provided by Lee et al. (lee2020finite, ) demonstrates the divergent generalization performances of these two kernels across various datasets.

In this paper, we undertake an endeavor to unify both the NNGP and NTK kernels and present the Unified Neural Kernel (UNK) as a cohesive framework for neural network learning. By leveraging the learning dynamics associated with gradient descents and parameter initialization, we delve into theoretical characterizations, including but not limited to the existence, limiting properties, uniform tightness, and learning convergence of the proposed UNK kernel. Our theoretical investigations reveal that the UNK kernel exhibits behaviors reminiscent of the NTK kernel with a finite learning step and converges to the NNGP kernel as the learning step approaches infinity. This contribution not only significantly expands the scope of the existing elegant theory connecting kernel learning and neural network learning, but also represents a substantial step toward unraveling the true intricacies of deep learning.

Our main contributions can be summarized as follows:

•

We propose the UNK kernel, built upon the learning dynamics associated with gradient descents and parameter initialization, which unifies the limiting properties of both the NTK and NNGP kernels.
•

We theoretically investigate the asymptotic behaviors of the proposed UNK kernel, in which the UNK kernel is uniformly tight on the space of continuous functions and maintains a tight bound for the smallest eigenvalue.
•

We conduct experiments on benchmark datasets using various configurations. The numerical results further underscore the effectiveness of our proposed method.

The rest of this paper is organized as follows. Section 2 introduces useful notations, terminologies, and related studies. Section 3 presents the UNK kernel with in-depth discussions and proof sketches. Section 4 shows the uniform tightness and convergence of the UNK kernel. Section 5 conducts numerical experiments. Section 6 concludes our work.

2 Preliminary

This section will introduce useful notations, terminologies, and related studies.

2.1 Notations

Let $[N]=\{1,2,\dots,N\}$ be an integer set for $N\in\mathbb{N}^{+}$ , and $|\cdot|_{\#}$ denotes the number of elements in a collection, e.g., $|[N]|_{\#}=N$ . Given two functions $g,h\colon\mathbb{N}^{+}\rightarrow\mathbb{R}$ , we denote by $h=\mathbf{\Theta}(g)$ if there exist positive constants $c_{1},c_{2}$ , and $n_{0}$ such that $c_{1}g(n)\leq h(n)\leq c_{2}g(n)$ for every $n\geq n_{0}$ ; $h=\mathcal{O}(g)$ if there exist positive constants $c$ and $n_{0}$ such that $h(n)\leq cg(n)$ for every $n\geq n_{0}$ ; $h=\Omega(g)$ if there exist positive constants $c$ and $n_{0}$ such that $h(n)\geq cg(n)$ for every $n\geq n_{0}$ . We define the globe $\mathcal{B}(r)=\{\bm{x}\mid\|\bm{x}\|_{2}\leq r\}$ for any $r\in\mathbb{R}^{+}$ . Let $\mathbf{I}_{n}$ be the $n\times n$ -dimensional identity matrix. Let $\|\cdot\|_{p}$ be the norm of a vector or matrix, in which we employ $p=2$ as the default. Given $\bm{x}=(x_{1},\dots,x_{n})$ and $\bm{y}=(y_{1},\dots,y_{n})$ , we also define the sup-related measure as $\|\bm{x}-\bm{y}\|_{\alpha}^{\textrm{sup}}=\sup_{i\in[n]}\big{|}x_{i}-y_{i}\big% {|}^{\alpha}$ for $\alpha>0$ .

Let $\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})$ be the space of continuous functions where $n_{0},n\in\mathbb{N}$ . Provided a linear and bounded functional $\mathcal{F}:\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})\to\mathbb{R}$ and a function $f\in\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})$ which satisfies $f(\bm{x})\overset{\underset{\mathrm{d}}{}}{\to}f^{*}$ , then we have $\mathcal{F}(f(\bm{x}))\overset{\underset{\mathrm{d}}{}}{\to}\mathcal{F}(f^{*})$ and $\mathbb{E}\left[\mathcal{F}(f(\bm{x}))\right]\to\mathbb{E}\left[\mathcal{F}(f^% {*})\right]$ according to General Transformation Theorem (van2000asymptotic, , Theorem 2.3) and Uniform Integrability (billingsley2013convergence, ), respectively.

Throughout this paper, we use the specific symbol $K$ to denote the concerned kernel for neural network learning. The superscript $(l)$ and stamp $t$ are used for recording the indexes of hidden layers and training epochs, respectively. We denote the Gaussian distribution by $\mathcal{N}(\mu_{x},\sigma_{x}^{2})$ , where $\mu_{x}$ and $\sigma_{x}^{2}$ indicate the mean and variance, respectively. In general, we employ $\mathbb{E}(\cdot)$ and $\mathrm{Var}(\cdot)$ to denote the expectation and variance, respectively.

2.2 NNGP and NTK

We start this work with an $L$ -hidden-layer fully-connected neural networks, where $n_{l}$ and $n_{0}$ indicate the number of neurons in the $l$ -th hidden layer for $l\in[L]$ and input, respectively, as follows

\left\{\leavevmode\nobreak\ \begin{aligned} \bm{s}^{(0)}&=\bm{x}\ ,\\ \bm{h}^{(l)}&=\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\ ,\quad l\in[L]\ ,\\ \bm{s}^{(l)}&=\phi(\bm{h}^{(l)})\ ,\quad l\in[L]\ ,\\ \bm{y}&=\bm{s}^{L}\ ,\end{aligned}\right.

(1)

in which $\bm{x}\in\mathbb{R}^{n_{0}}$ and $\bm{y}\in\mathbb{R}^{n_{L}}$ indicate the variables of inputs respectively, $\bm{h}^{(l)}\in\mathbb{R}^{n_{l}}$ and $\bm{s}^{(l)}\in\mathbb{R}^{n_{l}}$ denote the pre-synaptic and post-synaptic variables of the $l$ -th hidden layer respectively, $\mathbf{W}^{(l)}\in\mathbb{R}^{n_{l}\times n_{l-1}}$ and $\bm{b}^{(l)}\in\mathbb{R}^{n_{l}}$ are the parameter variables of connection weights and bias respectively, and $\phi$ is an element-wise activation function. For convenience, we here note the parameter variables at the $t$ -th epoch as $\Theta^{(l)}_{t}=[\mathbf{W}^{(l)},\bm{b}^{(l)}]$ , and $\Theta^{(l)}_{0}$ denotes the initialized parameters, of which the value obeys the Gaussian distribution $\mathcal{N}(0,\sigma^{2}/n_{l})$ .

Neural Network Gaussian Process (NNGP). For any $l\in[L]$ , there is a claim that the conditional variable $\bm{h}^{(l)}\mid\bm{s}^{(l-1)}$ obeys the Gaussian distribution. In detail, one has $\textrm{Var}(\bm{h}^{(l)}\mid\bm{s}^{(l-1)})=\textrm{Var}(\mathbf{W}^{(l)})% \mathbb{E}(\bm{s}^{(l-1)})^{2}+\textrm{Var}(\bm{b}^{(l)})$ , where $\cdot^{2}$ and $\cdot$ denote the dot product and this equality holds according to $\mathbb{E}(\mathbf{W}^{(l)})=\mathbf{0}$ , $\mathbb{E}(\bm{b}^{(l)})=\bm{0}$ , and the mutual independence of elements $\mathbf{W}^{(l)}$ and $\bm{b}^{(l)}$ . It is reasonable to conjecture that $\bm{s}^{(l-1)}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{l-1}}/C_{\phi})$ according to the principle of mathematical induction and $\bm{x}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{0}})$ , where $C_{\phi}={1}/{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left(\phi(z)\right)^{2}}$ . Hence, one has

\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\sim\mathcal{N}\left(\bm{0},\frac{\sigma^{2}}{n_% {l-1}}\left(\frac{1}{C_{\phi}}+1\right)\mathbf{I}_{n_{l}}\right)\ .

Moreover, the NNGP kernel is defined by

K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=\sigma% ^{2}\leavevmode\nobreak\ \mathbb{E}\left\langle\bm{s}^{(l-1)},\bm{s}^{\prime(l% -1)}\right\rangle+\sigma^{2}

with

\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\bm{h}^{(l)}\mid\bm{s}^{(l% -1)},\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\right\rangle=\sigma^{2}\left(\frac{1}{C_{% \phi}}+1\right)\ .

Neural Tangent Kernel (NTK). The training of the concerned ANNs consists in optimizing $\bm{y}=f(\bm{x};\Theta)$ in the function space, supervised by a functional loss $\hbar(\Theta)$ , such as the square or cross-entropy functions, where we employ $\Theta$ to denote the variable of any parameter

\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}f(\bm{x};\Theta)}\frac{\mathop{}% \!\mathrm{d}f(\bm{x};\Theta)}{\mathop{}\!\mathrm{d}\Theta}\ .

For any $l\geq 2$ , there is a claim that the gradient variable vector $\bm{h}^{(l)}\mid\bm{s}^{(l-1)}$ obeys the Gaussian distribution. Taking $\mathbf{W}^{(l-1)}$ as an example, one has $\textrm{Var}({\partial\bm{h}^{(l)}}/{\partial\mathbf{W}_{ij}^{(l-1)}})=\textrm% {Var}(\mathbf{W}^{(l)})\mathbb{E}({\partial\bm{s}^{(l-1)}}/{\partial\bm{h}^{(l% -1)}})^{2}\textrm{Var}(\bm{s}^{(l-2)})$ for $i,j\in\mathbb{N}^{+}$ , where ${\partial\bm{s}^{(l-1)}}/{\partial\bm{h}^{(l-1)}}$ adopts the dot operation. Hence, one has

\frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\sim\mathcal{N}% \left(\bm{0},\frac{\sigma^{2}}{n_{l-1}C^{\prime}_{\phi}C_{\phi}}\mathbf{I}_{n_% {l-1}}\right)\ ,

where $C^{\prime}_{\phi}={1}/{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left(\phi^{\prime}(z% )\right)^{2}}$ . Moreover, the NTK kernel is defined by

\left\{\begin{aligned} K_{\textrm{NTK}}^{(1)}\left(\bm{x},\bm{x}^{\prime}% \right)&=K_{\textrm{NNGP}}^{(1)}\left(\bm{x},\bm{x}^{\prime}\right)\ ,\quad% \text{for}\quad l=1\ ,\\ K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)&=K_{% \textrm{NTK}}^{(l-1)}\left(\bm{s}^{(l-2)},\bm{s}^{\prime(l-2)}\right)\mathbb{E% }\left\langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}},\frac{% \partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\right\rangle\\ &\quad+K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)% \ ,\quad\text{for}\quad l\geq 2\ ,\end{aligned}\right.

with

\left\{\begin{aligned} &\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle% \frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}},\frac{\partial\bm% {h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{% \prime}_{\phi}C_{\phi}}\ ,\\ &\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)% }}{\partial\bm{b}_{i}^{(l-1)}},\frac{\partial\bm{h}^{(l)}}{\partial\bm{b}_{i}^% {(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{\prime}_{\phi}}\ .\end{aligned}\right.

2.3 Related Studies

Past decades have witnessed a growing interest in the correspondence between neural network learning and Gaussian processes. Neal et al. (neal1996:GP, ) presented the seminal work by showing that a one-hidden-layer network of infinite width turns into a Gaussian process. Cho et al. (cho2009:GP, ) linked the multi-layer networks using rectified polynomial activation with compositional Gaussian kernels. Lee et al. (lee2018:NNGP, ) showed that the infinitely wide fully connected neural networks with common-used activation functions can converge to Gaussian processes. Recently, the NNGP has been scaled to many types of networks, including Bayesian networks (novak2018:GP, ), deep networks with convolution (garriga2019:GP, ), and recurrent networks (yang2019:GP, ).

NNGPs can provide a quantitative characterization of how likely certain outcomes are if some aspects of the system are not exactly known. In the experiments of (lee2018:NNGP, ), an explicit estimate in the form of variance prediction is given to each test sample. Besides, Pang et al. (pang2019:NNGP, ) showed that the NNGP is good at handling data with noise and is superior to discretizing differential operators in solving some linear or nonlinear partial differential equations. Park et al. (park2020:NNGP, ) employed the NNGP kernel in the performance measurement of network architectures for the purpose of speeding up the neural architecture search. Pleiss et al. (pleiss2022:NNGP, ) leveraged the effects of width on the capacity of neural networks by decoupling the generalization and width of the corresponding NNGP. Despite great progress, numerous studies about NNGP still rely on increasing width to induce the Gaussian processes. Recently, Zhang et al. (zhang2022:NNGP, ) proposed a depth paradigm that achieves an NNGP by increasing depth, providing complementary support for the existing theory of NNGP.

The NTK kernel, first proposed by Jacot et al. (jacot2018:NTK, ), relates a neural network trained by randomly initialized gradient descent with a Gaussian distribution. It has been proved that many types of networks, including graph neural networks on bioinformatics datasets (du2019:GNTK, ) and convolution neural network (arora2019:NTK, ) on medium-scale datasets like UCI database, can derive a corresponding kernel function. Some researchers applied NTK to various fields, such as federated learning (huang2021:NTK, ), mean-field analysis (mahankali2023:NTK, ), and natural language processing (malladi2023:NTK, ). Recently, Hron et al. (hron2020:attention, ) derived the NNGP and NTK from neural networks to multi-head attention architectures as the number of heads tends to infinity. Avidan et al. (avidan2023:connecting, ) provided a unified theoretical framework that connects NTK and NNGP using the Markov proximal learning model.

3 The Unified Kernel

This work considers a general form of supervised learning

\min_{\Theta}\quad\hbar(\Theta)+\lambda\mathcal{R}(\Theta)

(2)

where $\mathcal{R}(\Theta)$ is a regularizer and $\lambda$ is the corresponding multiplier. Based on gradient descent, Eq. (2) generally leads to a dynamical system with respect to parameter $\Theta$

\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}-\lambda\frac{\mathop{}\!% \mathrm{d}\mathcal{R}(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\ ,

(3)

where we omit the learning rate for simplicity. From Eq. (3), the value of $\lambda$ can be regarded as a balance between the gradient and regularizer. In the next subsections, we will employ the initialized and epoch-related parameter to implement ${\mathop{}\!\mathrm{d}\mathcal{R}(\Theta)}/{\mathop{}\!\mathrm{d}\Theta}$ , where both regularization implementations induce the UNK kernel. Furthermore, Subsection 5.2 provides in-depth discussions about the effect of $\lambda$ on the performance of the UNK kernel.

3.1 Initialization Parameter $\Theta_{0}$

In this work, we first consider leveraging the effects of initialized parameters²²2For example, one just employs the square regularizer in Eq. (3)., and thus Eq. (3) becomes

\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}-\lambda\Theta% _{0}\ ,

(4)

where $\Theta_{0}$ is the initialized parameter and $\lambda\in\mathbb{R}$ takes a tradeoff between parameter gradient and initialization.

Now, we present our main conclusion as follows.

Theorem 1

For a network of depth $L$ with a Lipschitz activation $\phi$ and in the limit of the layer width $n_{1},\dots,n_{L-1}\to\infty$ , Eq. (4) induces a kernel with the following form, for $l\in[L]$ and $t\geq 0$ ,

K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=\exp% \left(\frac{-t\leavevmode\nobreak\ |\lambda|}{\sqrt{1-\rho_{t}^{2}}\sigma_{0}% \sigma_{t}}\right)\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}}{\partial% \Theta_{t}},\frac{\partial\bm{h}^{\prime(l)}}{\partial\Theta_{t}}\right\rangle\ ,

(5)

where $\rho_{t}$ is the correlation coefficients of variables along training epoch $t$ , $\sigma_{0}^{2}$ and $\sigma_{t}^{2}$ , and $\rho_{t}$ denote the variance and correlation coefficients of variables along training epoch 0 and $t$ , respectively. Furthermore, $K_{\textrm{UNK}}(t,\cdot,\cdot)$ has the following properties of limiting kernels

(i)

For the case of $\lambda=0$ or $t=0$ , the unified kernel is degenerated as the NTK kernel. Formally, for $l\in[L]$ , the followings hold

		$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)% };\lambda=0\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l% -1)}\right)\ ,$
		$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t=0,\bm{s}^{(l-1)},\bm{s}^{\prime(l-% 1)}\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ .$

(ii)

For the case of $\lambda\neq 0$ and $t\to\infty$ , the unified kernel equals to the NNGP kernel, i.e., the following holds for $l\in[L]$ as $t\to\infty$

K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\to K_% {\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .

Theorem 1 presents the existence and explicit formulation of the unified kernel $K_{\textrm{UNK}}(t,\cdot,\cdot)$ that corresponds to Eq. (4) for neural network learning. For the case of $t=0$ or $\lambda=0$ , the proposed kernel can be degenerated as the NTK kernel, where the parameter updating obeys the Gaussian distribution. Relatively, for the case of $t\to\infty$ and $\lambda\neq 0$ , the proposed kernel can approximate the NNGP kernel well, which implies that a neural network model trained by Eq. (4) can reach an equilibrium state in a long-time regime. The proof sketch is listed in Subsection 3.3, and the full proof can be accessed in Appendix.

Similar to the NNGP and NTK kernels, the unified kernel is also of a recursive form, that is,

	$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)% }\right)$	$\displaystyle=K_{\textrm{UNK}}^{(l-1)}\left(t,\bm{s}^{(l-2)},\bm{s}^{\prime(l-% 2)}\right)\mathbb{E}\left\langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{% (l-1)}},\frac{\partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\right\rangle$		(6)
		$\displaystyle\quad+\exp\left(\frac{-t\leavevmode\nobreak\ \|\lambda\|}{\sqrt{1-% \rho_{t}^{2}}\sigma_{0}\sigma_{t}}\right)K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{% (l-1)},\bm{s}^{\prime(l-1)}\right)\ .$		(6)

3.2 Epoch-related Parameter $\Theta_{t^{\prime}}$

From Eq. (6), it is observed that the unified kernel of the $l$ -th hidden layer at epoch $t$ can be computed recursively from a combination of the unified kernel of the $(l-1)$ -th hidden layer at epoch $t$ and the NNGP kernel of the $l$ -th hidden layer at epoch $t$ . Inspired by this recognition, we extend the fundamental formula in Eq. (4) as

\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}-\lambda\Theta% _{t^{\prime}}

(7)

given $t^{\prime}<t$ . Obviously, Eq. (7) has a general updating formulation, taking Eq. (4) as a special case of $t^{\prime}=0$ . However, Eq. (7) leads to a more general updating paradigm. For example, $\Theta_{t^{\prime}}$ may indicate a collection of pre-given parameters from pre-training or meta-learning, so that Eq. (7) becomes an optimization computation for fine-tuning. Further, the derived kernel may support the theoretical analysis of the fine-tuning learning after pre-training. The effectiveness of Eq. (7) will be demonstrated in Section 5.

We directly provide the theoretical framework of unified kernels relative to the parameter updating in Eq. (7).

Theorem 2

For a network of depth $L$ with a Lipschitz activation $\phi$ and in the limit of the layer width $n_{1},\dots,n_{L-1}\to\infty$ , Eq. (7) induces a kernel with the following form, for $l\in[L]$ and $t\geq t^{\prime}$ ,

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\mathbb{E}\left% \langle\frac{\partial\bm{h}^{(l)}}{\partial\Theta_{t}},\frac{\partial\bm{h}^{% \prime(l)}}{\partial\Theta_{t}}\right\rangle\ ,

(8)

where $\rho_{t,t^{\prime}}$ denotes the correlation coefficient of variables along training epochs $t$ and $t^{\prime}$ , and $\sigma_{t}$ and $\sigma_{t^{\prime}}$ are the corresponding variances. Furthermore, the unified kernel $K_{\textrm{UNK}}(t,t^{\prime},\cdot,\cdot)$ has the following properties

(i)

For the case of $\lambda=0$ or $t=t^{\prime}$ , the unified kernel degenerates as the NTK kernel, that is, for $l\in[L]$

		$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{% \prime(l-1)};\lambda=0\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s% }^{\prime(l-1)}\right)\ ,$
		$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-% 1)}\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ .$

(ii)

For the case of $\lambda\neq 0$ and $t-t^{\prime}\to\infty$ , the unified kernel equals to the NNGP kernel, i.e., the following holds for $l\in[L]$ as $t-t^{\prime}\to\infty$ ,

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\to K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ .

Theorem 2, a general extension of Theorem 1, presents a unified kernel $K_{\textrm{UNK}}(t,t^{\prime},\cdot,\cdot)$ for neural network learning with Eq. (7). For the case of $t=t^{\prime}$ or $\lambda=0$ , the proposed kernel can be degenerated as the NTK kernel, where the parameter updating obeys the Gaussian distribution. Relatively, for the case of $t-t^{\prime}\to\infty$ and $\lambda\neq 0$ , the proposed kernel can approximate the NNGP kernel well, which implies that a neural network model trained by Eq. (7) can reach an equilibrium state in a long time regime. We provide a proof sketch in Subsection 3.3; the full proof can be accessed in Appendix.

It is observed that the unified kernel led by Eq. (7) can be re-written in a recursive form

	$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{% \prime(l-1)}\right)$	$\displaystyle=K_{\textrm{UNK}}^{(l-1)}\left(t,t^{\prime},\bm{s}^{(l-2)},\bm{s}% ^{\prime(l-2)}\right)\mathbb{E}\left\langle\frac{\partial\bm{s}^{(l-1)}}{% \partial\bm{h}^{(l-1)}}\Big{\|}_{\Theta_{t}},\frac{\partial\bm{s}^{\prime(l-1)}% }{\partial\bm{h}^{\prime(l-1)}}\Big{\|}_{\Theta_{t^{\prime}}}\right\rangle$		(9)
		$\displaystyle\quad+\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ \|\lambda% \|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)K_{% \textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(% \Theta_{t^{\prime}})\right)\ .$		(9)

3.3 Proof Sketch

It is obvious that Eq. (4) is a special case of Eq. (7) when one forces $t^{\prime}=0$ . We start this proof with unfolding Eq. (7) in the following discrete form

\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}-\lambda\Theta_{t^{\prime}}\ ,

where $t+\mathop{}\!\mathrm{d}t$ and $t$ represent the epoch stamps in which $\mathop{}\!\mathrm{d}t$ denotes the epoch infinitesimal. According to the mathematical induction, we can employ $\Theta_{t^{\prime}}$ drawn from the Gaussian distribution $\mathcal{N}(0,\sigma_{t^{\prime}}^{2})$ . By direct computations, we have

\mathrm{Var}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}\right)=\textrm{Var}\left(% \Theta_{t}-\nabla_{t}\right)+\lambda^{2}\textrm{Var}\left(\Theta_{t^{\prime}}% \right)+2\left[\mathbb{E}\left(\Theta_{t}-\nabla_{t}\right)\mathbb{E}\left(% \lambda\Theta_{t^{\prime}}\right)-\mathbb{E}\left((\Theta_{t}-\nabla_{t})% \lambda\Theta_{t^{\prime}}\right)\right]\ ,

where $\nabla_{t}={\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}/{\mathop{}\!\mathrm{d}% \Theta_{t}}$ . Notice that $\Theta_{t}-\nabla_{t}$ is almost independent to $\Theta_{t^{\prime}}$ as $t\to\infty$ . It is observed that $\mathrm{Var}(\Theta_{t+\mathop{}\!\mathrm{d}t})$ converges as $n\to\infty$ and $t\to\infty$ . Thus, the variable sequence $\{\mathrm{Var}(\Theta_{t})\}_{t}$ is bounded. Here, we define that $\mathrm{Var}(\Theta_{t})\leq\sigma_{t}^{2}$ and $\sigma^{2}=\max_{t}\sigma_{t}^{2}$ . Let $f_{\Theta_{t}}(\cdot)$ denote the probability density function of $\Theta_{t}$ . Thus, we have

f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iiint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)f_{\Theta_{0}}(z)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y% \!\mathop{}\!\mathrm{d}z

(10)

with

\left\{\leavevmode\nobreak\ \begin{aligned} f_{\Theta_{t}}(x)&=\frac{1}{\sigma% _{x}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{x}^{2}}\right)\\ f_{\nabla_{t}}(y)&=\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{y^{2}}{2% \sigma_{y}^{2}}\right)\\ f_{\Theta_{0}}(z)&=\frac{1}{\sigma_{z}\sqrt{2\pi}}\exp\left(-\frac{z^{2}}{2% \sigma_{z}^{2}}\right)\\ \end{aligned}\right.

where $v=u-x+y+\lambda z$ and $\delta(\cdot)$ indicates the Dirac-delta function. According to the independence, one has

f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iint_{x,y}f_{\Theta_{t}}(x)f_{\nabla% _{t}}(y)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y\int_{\Omega_{z}}f_{% \Theta_{0}}(z)\mathop{}\!\mathrm{d}z\ ,

(11)

where $\Omega_{z}=\{(x,y)\mid(-u+x-y)/\lambda=0\}$ . Thus, we can claim that $\Theta_{t+\mathop{}\!\mathrm{d}t}$ obeys the Gaussian distribution with zero mean, which completes the mathematical induction.

All statistics of post-synaptic variables $\bm{s}$ can be calculated via the moment generating function $\mathcal{M}_{\bm{s}}(a)=\int\mathop{}\!\mathrm{e}^{a\bm{s}}f(\bm{s})\mathop{}% \!\mathrm{d}\bm{s}$ . Here, we focus on the second moment of $s=\bm{s}^{(l)}_{i}$ for $l\in[L]$ and $i\in[n_{l}]$ , that is,

m_{2}(s)=\int s^{2}\leavevmode\nobreak\ f(s)\mathop{}\!\mathrm{d}s=\int s^{2}(% \Theta)\leavevmode\nobreak\ f_{\Theta}(\Theta)\leavevmode\nobreak\ \frac{% \mathop{}\!\mathrm{d}s(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm% {d}\Theta\ .

(12)

By substituting Eq. (10) into Eq. (12), we can obtain the concerned kernel

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\mathbb{E}\left% \langle\frac{\partial\bm{h}^{(l)}}{\partial\Theta_{t}},\frac{\partial\bm{h}^{(% l)}}{\partial\Theta_{t}}\right\rangle\ ,

which is the desired kernel in Theorem 1.

It is observed that Eq. (5) equals the NTK kernel in the case of $\lambda\neq 0$ and $t=t^{\prime}$ . Similarly, it is easily proved that

\lim\limits_{t\to\infty}\int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l)}\left(t,t^{% \prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\mathop{}\!\mathrm{d}t=\frac% {\sigma^{2}\delta(t)}{|\lambda|}K_{\textrm{NNGP}}^{(l)}\ ,

where $\delta(t)\propto\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sim\mathbf{\Theta}((t-t^{% \prime})^{-1})$ . The above formula reveals that a smaller absolute value of $\lambda$ may lead to a larger convergence rate. Thus, we have

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\to K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ ,

as $t-t^{\prime}\to\infty$ . The detailed proof can be accessed in the Appendix. $\hfill\square$

4 Uniform Tightness and Convergence

Here, we provide two theorems to further show the theoretical properties of the proposed NUK kernel.

4.1 Uniform Tightness of NNGP^(d)

Now, we present the following theorem.

Theorem 3

For any $l\in[L]$ , the unified kernel $K_{\textrm{UNK}}^{(l)}$ , described in Theorem 2, is uniformly tight in $\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})$ .

Theorem 3 delineates the asymptotic behavior of $K_{\textrm{UNK}}^{(l)}$ as $t-t^{\prime}\to\infty$ for $l\in[L]$ , revealing an intrinsic characteristic of uniform tightness. Based on Theorem 3, one can obtain the properties of functional limit and continuity of $K_{\textrm{UNK}}^{(l)}$ , in analogy to those of $K_{\textrm{NNGP}}^{(l)}$ bracale2020:asymptotic .

Theorem 3 establishes upon three useful lemmas from (zhang2022:NNGP, ).

Lemma 4.4

Let $\{\bm{s}_{1},\bm{s}_{2},\dots,\bm{s}_{t}\}$ denote a sequence of random variables in $\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})$ . This stochastic process is uniformly tight in $\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})$ , if the following two hold: (1) $\bm{x}=\bm{0}$ is a uniformly tight point of $\bm{s}_{t}(\bm{x})$ ( $t\in[T]$ ) in $\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})$ ; (2) for any $\bm{x},\bm{x}^{\prime}\in\mathbb{R}^{n_{0}}$ , and $t\in[T]$ , there exist $\alpha,\beta,C>0$ , such that

\mathbb{E}\left[|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{\prime})|^{\alpha}% \right]\leq C\|\bm{x}-\bm{x}^{\prime}\|_{\beta+n_{0}}\ .

Lemma 4.4 shows core guidance for proving Theorem 3.

Lemma 4.5

Based on the notations of Lemma 4.4, $\bm{x}=\bm{0}$ is a uniformly tight point of $\bm{s}_{t}(\bm{x})$ ( $t\in[T]$ ) in $\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})$ .

The convergence in distribution from Lemma 4.5 paves the way for the convergence of expectations.

Lemma 4.6

Based on the notations of Lemma 4.4, for any $\bm{x},\bm{x}^{\prime}\in\mathbb{R}^{n_{0}}$ and $t\in[T]$ , there exist $\alpha,\beta,C>0$ , such that $\mathbb{E}\left[\|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{\prime})\|_{\alpha}^{% \textrm{sup}}\leavevmode\nobreak\ \right]\leq C\|\bm{x}-\bm{x}^{\prime}\|_{% \beta+n_{0}}$ .

The proofs of lemmas above can be accessed from Appendix D. Notice that the above lemmas take the stochastic process of hidden neuron vectors with increasing epochs regardless of the layer index, i.e., the above lemmas hold for $\bm{s}^{(l)}$ $(l\in[L])$ . For the case of two stamps $t$ and $t^{\prime}$ where $t^{\prime}<t$ , the concerned stochastic process becomes $\{\bm{s}_{t^{\prime}},\bm{s}_{2},\dots,\bm{s}_{t}\}$ , and thus the above conclusions also hold. Therefore, Theorem 3 can be completely proved by invoking Lemmas 4.5 and 4.6 into Lemma 4.4.

4.2 Tight Bound for the Smallest Eigenvalue

In this subsection, we investigate the learning convergence of the UNK kernel. The key idea is to bind the small eigenvalues of $K_{\textrm{UNK}}^{(l)}$ for $l\in[L]$ since the learning convergence is related to the positive definiteness of the limiting neural kernels. Here, we consider the neural networks equipped with ReLU activation and then draw the following conclusion.

Theorem 4.7

Let $\bm{x}_{1},\dots,\bm{x}_{N}$ be i.i.d. sampled from $P_{X}$ , which satisfies that $P_{X}=\mathcal{N}(0,\eta^{2})$ , $\int\bm{x}\mathop{}\!\mathrm{d}P\left(\bm{x}\right)=0$ , $\int\|\bm{x}\|_{2}\mathop{}\!\mathrm{d}P(\bm{x})=\mathbf{\Theta}(\sqrt{n_{0}})$ , and $\int\|\bm{x}\|_{2}^{2}\mathop{}\!\mathrm{d}P(\bm{x})=\mathbf{\Theta}(n_{0})$ . For an integer $r\geq 2$ , with probability $1-\delta>0$ , we have

\chi_{\min}\left(K_{\textrm{UNK}}^{(l)}\right)=\mathbf{\Theta}(n_{0})

for $l\in[L]$ , where $\chi_{\min}$ denotes the smallest eigenvalue and

\delta\leq N\mathop{}\!\mathrm{e}^{-\Omega(n_{0})}+N^{2}\mathop{}\!\mathrm{e}^% {-\Omega(n_{0}N^{-2/(r-0.5)})}\ .

Theorem 4.7 provides a tight bound for the smallest eigenvalue of the UNK kernel $K_{\textrm{UNK}}^{(l)}$ , which is closely related to the training convergence of neural networks. This nontrivial estimation mirrors the characteristics of this kernel, and usually be used as a key assumption for optimization and generalization. The key idea of proving Theorem 4.7 is based on the following inequalities about the smallest eigenvalue of real-valued symmetric square matrices. Given two symmetric matrices $\mathbf{A},\mathbf{B}\in\mathbb{R}^{m\times m}$ , it is observed that

\left\{\begin{aligned} &\chi_{\min}(\mathbf{A}\mathbf{B})\geq\chi_{\min}(% \mathbf{A})\cdot\min_{i\in[m]}\mathbf{B}(i,i)\ ,\\ &\chi_{\min}(\mathbf{A}+\mathbf{B})\geq\chi_{\min}(\mathbf{A})+\chi_{\min}(% \mathbf{B})\ .\end{aligned}\right.

(13)

From Eq. (9), we can unfold $K_{\textrm{UNK}}^{(l)}$ as a sum of covariance of the sequence of random variables $\{\bm{s}^{(l-1)}\}$ . Thus, we can bound $\chi_{\min}(K_{\textrm{UNK}}^{(l)})$ by $\mathrm{Cov}(\bm{s}^{(l-1)},\bm{s}^{(l-1)})$ via a chain of feedforward compositions in Eq. (1). For conciseness, we put the proof of Theorem 4.7 into Appendix E.

Refer to caption — Figure 1: The accuracy curves with various multipliers $\lambda\in\{0.001,0.01,0.1,0,1,10\}$ , where the x- and y-axes denote the epoch and accuracy, respectively. Training accuracy curves provided (a) Baseline $\Theta_{0}$ , (b) Baseline $\Theta_{t^{\prime}}$ , and (c) Grid Search. Testing accuracy curves provided (e) Baseline $\Theta_{0}$ , (f) Baseline $\Theta_{t^{\prime}}$ , and (g) Grid Search. Comparison (d) training and (h) testing accuracy curves between Baseline $\Theta_{0}$ , Grid 0.001, and Grid 0.01.

5 Experiments

In this section, we conduct several experiments to evaluate the effectiveness of the proposed UNK kernel.

5.1 Datasets and Configurations

Following the experimental configurations of Lee et al. (lee2018:NNGP, ), we conduct the empirical evaluations on a two-hidden-layer MLP trained with various $\lambda$ . The conducted dataset is the MNIST handwritten digit data, which comprises a training set of 60,000 examples and a testing set of 10,000 examples in 10 classes, where each example is centered in a $28\times 28$ image.

For the classification tasks, the class labels are encoded into an opposite regression formation, where the correct label is marked as 0.9 and the incorrect one is marked as 0.1 (zhang2022:NNGP, ). Here, we employ 5000 hidden neurons and the softmax activation function. Similar to (arora2019:NNGP, ), all weights are initialized with a Gaussian distribution of the mean 0 and variance $0.3/n_{l}$ for $l\in[L]$ . We also force the batch size and the learning rate as 64 and 0.001, respectively. All experiments were conducted on Intel Core-i7-6500U.

5.2 Experiments for Effects of Various Multipliers $\lambda$

The experiments aim to leverage the effects of various $\lambda$ on the performance of the UNK kernel. According to the recursive formulation of $K_{\textrm{UNK}}^{(l)}$ , it is evident that $\lambda$ balances the gradient and regularizer. From the perspective of theoretical effects, the absolute value of $\lambda$ indicates not only the limiting convergence rate of $K_{\textrm{UNK}}^{(l)}$ but also the optimal solution of Eq. (2). Provided $\Theta_{t^{\prime}}$ , we can compute the optimal solution $\lambda^{*}_{t}$ at current epoch stamp $t$ as follows

\lambda^{*}_{t}=\arg\min_{t^{\prime}}\leavevmode\nobreak\ \hbar(\Theta_{t+% \mathop{}\!\mathrm{d}t})-\hbar(\Theta_{t})\ ,

(14)

where $\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-{\mathop{}\!\mathrm{d}\hbar(% \Theta_{t})}/{\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda_{t^{\prime}}\Theta_{t^{% \prime}}$ . This optimization problem can be solved by some mature algorithms, such as Bayesian optimization or grid search. Here, we conjecture that $\lambda^{*}_{t}$ is an effective indicator for identifying the optimal trajectory of the UNK kernel.

Here, we set the investigated values of the multiplier $\lambda$ to $\{0.001,0.01,0.1,0,1,10\}$ and employ three types of studied models as follows

\left\{\begin{aligned} \textrm{Baseline $\Theta_{0}$}:&\quad\frac{\mathop{}\!% \mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta_{t})}{\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda\Theta_{0}\ ,\\ \textrm{Baseline $\Theta_{t^{\prime}}$}:&\quad\frac{\mathop{}\!\mathrm{d}% \Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}% {\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda_{t^{\prime}}\Theta_{t^{\prime}}\ ,\\ \textrm{Grid Search}:&\quad\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!% \mathrm{d}t}=-\frac{\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}{\mathop{}\!\mathrm% {d}\Theta_{t}}-\lambda^{*}_{t}\Theta_{t-\mathop{}\!\mathrm{d}t}\ ,\end{aligned% }\right.

where the optimization problem in Eq. (14) is solved by gird search with the granularity of 0.001 and 0.01, which are denoted as Grid 0.001 and Grid 0.01, respectively.

Figure 1 draws various multipliers and the corresponding accuracy curves. There are several observations that (1) the performance of the training algorithms led by Eq. (2) is comparable to those of typical gradient descent in various configurations, (2) $\lambda=1$ and $\lambda=10$ are too large to hamper the performance of the UNK kernel, and (3) Grid 0.01 provides a starting point for higher accuracy and achieves the fastest convergence speed and best accuracy. The above observations not only show the effectiveness of our proposed UNK kernel, but also coincide with our theoretical conclusions that the UNK kernel converges to the NNGP kernel as $t\to\infty$ and a smaller value of $\lambda$ may lead to a larger convergence rate.

In detail, Table 1 lists the optimal trajectory and the corresponding training accuracy of Grid 0.001 and Grid 0.01 over the epoch. It is observed that (1) the optimal trajectory of the UNK kernel and the path of typical gradient descent are not completely consistent, and (2) both Grid 0.001 and Grid 0.01 achieve faster convergence speed and better accuracy than those of the baseline methods. These results further demonstrate the effectiveness of our proposed UNK kernel.

Epoch	Baseline	Grid 0.001		Grid 0.01
$t$	ACC.	$\lambda^{*}_{t}$	ACC.	$\lambda^{*}_{t}$	ACC.
1	0.1289	0.0100	0.9257	0.0800	0.9266
2	0.9256	0.0020	0.9506	0.0800	0.9521
3	0.9504	0.0040	0.9631	0.0900	0.9656
4	0.9629	0.0080	0.9708	0.0700	0.9737
5	0.9705	0.0070	0.9766	0.0900	0.9793
6	0.9763	0.0050	0.9802	0.1000	0.9839
7	0.9800	0.0060	0.9834	0.1000	0.9870
8	0.9831	0.0000	0.9858	0.0800	0.9899
9	0.9855	0.0080	0.9879	0.0500	0.9922
10	0.9875	0.0000	0.9898	0.0900	0.9939
11	0.9896	0.0000	0.9913	0.0600	0.9952
12	0.9910	0.0000	0.9923	0.0600	0.9963
13	0.9922	0.0040	0.9933	0.0700	0.9971
14	0.9931	0.0020	0.9943	0.0800	0.9977
15	0.9941	0.0020	0.9952	0.0500	0.9984
16	0.9949	0.0080	0.9959	0.0700	0.9987
17	0.9957	0.0060	0.9966	0.0900	0.9992
18	0.9963	0.0070	0.9972	0.0700	0.9995
19	0.9969	0.0070	0.9977	0.0000	0.9996
20	0.9974	0.0100	0.9981	0.0800	0.9998
21	0.9978	0.0070	0.9984	0.0100	0.9997
22	0.9982	0.0100	0.9986	0.0200	0.9999
23	0.9984	0.0050	0.9987	0.0000	0.9999
24	0.9986	0.0000	0.9989	0.0000	0.9999
25	0.9988	0.0050	0.9990	0.0000	0.9999
26	0.9989	0.0030	0.9992	0.0000	1.0000

Table 1: Illustration of

\lambda^{*}_{t}

and the corresponding training accuracy (ACC.) of Grid 0.001 and Grid 0.01 over epoch

t

5.3 Experiments for the UNK kernel

This experiment investigates the representation ability of our proposed UNK kernel. The indicator is computed as

\gamma^{2}_{i}=\frac{K(T,0,\bm{x}_{i})}{K(0,0,\bm{x}_{i})K(T,T,\bm{x}_{i})}\ ,

where $\bm{x}_{i}$ indicates the $i$ -th instance, and $K(T,0,\bm{x}_{i})$ denotes the UNK kernel trained by solving Eq. (14)

K(t,t^{\prime},\bm{x}_{i})\triangleq K_{\textrm{UNK}}^{(L)}\left(t,t^{\prime},% \bm{s}^{L-1}_{i}(t),\bm{s}^{L-1}_{i}(t^{\prime});\lambda^{*}_{t}\right)\ .

The value of $\gamma_{i}$ manifests the correlation between outputs of the UNK kernels with initialized and optimized parameters. According to the theoretical results in Section 3, the UNK kernel is said to be valid if the kernel outputs brought by initialized and optimized parameters are markedly discriminative. In other words, a valid UNK is able to classify digits well in this experiment, and thus $\gamma_{i}$ should equal $0.1\times 1=0.1$ , where the first 0.1 and 1 denote the accuracy of the UNK with initialized and optimized parameters, respectively. Ideally, the value of $\gamma_{i}$ in this experiment should trend towards 0.1, that is, $\mathbb{E}_{i}(\gamma_{i})=0.1$ . If $|\gamma_{i}|$ comes near one, the kernel cannot recognize the difference between the kernel output brought by initialized and optimized parameters, and thus the kernel is invalid.

Figure 2 displays the (training and testing) correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of 0.001 and 0.01. It is observed that the average training correlation values of Grid 0.001 and Grid 0.01 are almost 0.13 as training accuracy goes to 100%, which implies that the trained UNK kernel is valid for classifying MNIST. This is a laudable result for the theory and development of neural kernel learning.

Notice that the average training correlation values for Grid 0.001 and Grid 0.01 are not precisely equal to 0.1, and the average testing correlation values for Grid 0.001 and Grid 0.01 are approximately 0.2 instead of the stated value of 0.1. These discrepancies could be attributed to several factors, including gaps between the softmax and labeled vectors and out-of-distribution errors. More detailed experimental results are listed in Appendix F.

6 Conclusions

In this paper, we proposed the UNK kernel, a unified framework for neural network learning that draws upon the learning dynamics associated with gradient descents and parameter initialization. Our investigation explores theoretical aspects, such as the existence, limiting properties, uniform tightness, and learning convergence of the proposed UNK kernel. Our main findings highlight that the UNK kernel exhibits behaviors akin to the NTK kernel with a finite learning step and converges to the NNGP kernel as the learning step approaches infinity. Experimental results further emphasize the effectiveness of our proposed method.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

(1) S. Arora, S. S. Du, W. hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32, pages 8141–8150, 2019.
(2) S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 322–332, 2019.
(3) Y. Avidan, Q. Li, and H. Sompolinsky. Connecting NTK and NNGP: A unified theoretical framework for neural network learning dynamics in the kernel regime. arXiv:2309.04522, 2023.
(4) P. Billingsley. Convergence of Probability Measures. John Wiley & Sons, 2013.
(5) D. Bracale, S. Favaro, S. Fortini, and S. Peluchetti. Large-width functional asymptotics for deep gaussian neural networks. In Proceedings of the 8th International Conference on Learning Representations, 2020.
(6) Y. Cho and L. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, pages 342–350, 2009.
(7) S. S. Du, K. Hou, R. R. Salakhutdinov, B. Poczos, R. Wang, and K. Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems 32, pages 5723 – 5733, 2019.
(8) A. Garriga-Alonso, C. Rasmussen, and L. Aitchison. Deep convolutional networks as shallow gaussian processes. In Proceedings of the 7th International Conference on Learning Representations, 2019.
(9) J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak. Infinite attention: NNGP and NTK for deep attention networks. In Proceedings of the 37th International Conference on Machine Learning, pages 4376–4386, 2020.
(10) B. Huang, X. Li, Z. Song, and X. Yang. FL-NTK: A neural tangent kernel-based framework for federated learning analysis. In Proceedings of the 38th International Conference on Machine Learning, pages 4423–4434, 2021.
(11) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, pages 8580 – 8589, 2018.
(12) J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. In Proceedings of the 6th International Conference on Learning Representations, 2018.
(13) J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein. Finite versus infinite neural networks: An empirical study. In Advances in Neural Information Processing Systems 33, pages 15156–15172, 2020.
(14) A. Mahankali, J. Z. Haochen, K. Dong, M. Glasgow, and T. Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. arXiv:2306.16361, 2023.
(15) S. Malladi, A. Wettig, D. Yu, D. Chen, and S. Arora. A kernel-based view of language model fine-tuning. In Proceedings of the 40th International Conference on Machine Learning, pages 23610–23641, 2023.
(16) M. Mézard, G. Parisi, and M. A. Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications. World Scientific Publishing Company, 1987.
(17) R. M. Neal. Priors for infinite networks. Bayesian Learning for Neural Networks, pages 29–53, 1996.
(18) Q. Nguyen, M. Mondelli, and G. Montufar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In Proceedings of the 38th International Conference on Machine Learning, pages 8119–8129, 2021.
(19) R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, and J. Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In Proceedings of the 6th International Conference on Learning Representations, 2018.
(20) G. Pang, L. Yang, and G. E. Karniadakis. Neural-net-induced gaussian process regression for function approximation and PDE solution. Journal of Computational Physics, 384:270–288, 2019.
(21) D. S. Park, J. Lee, D. Peng, Y. Cao, and J. Sohl-Dickstein. Towards NNGP-guided neural architecture search. arXiv:2011.06006, 2020.
(22) G. Pleiss and J. P. Cunningham. The limitations of large width in neural networks: A deep gaussian process perspective. In Advances in Neural Information Processing Systems 34, pages 3349–3363, 2021.
(23) T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings of the National Academy of Sciences, 117(48):30039–30045, 2020.
(24) Hector N Salas. Gershgorin’s theorem for matrices of operators. Linear Algebra and its Applications, 291(1-3):15–36, 1999.
(25) D. Stroock and S. Varadhan. Multidimensional Diffusion Processes. Springer Science & Business Media, 1997.
(26) A. W. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
(27) G. Yang. Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems 32, pages 9951–9960, 2019.
(28) S.-Q. Zhang, F. Wang, and F.-L. Fan. Neural network gaussian processes by increasing depth. IEEE Transactions on Neural Networks and Learning Systems, 2022.
(29) S.-Q. Zhang and Z.-H. Zhou. Arise: Aperiodic semi-parametric process for efficient markets without periodogram and gaussianity assumptions. arXiv:2111.06222, 2021.

Appendix

This appendix provides the supplementary materials for our work “A Unified Kernel for Neural Network Learning”, constructed according to the corresponding sections therein. Before that, we first review the useful notations. Let $[N]=\{1,2,\dots,N\}$ be an integer set for $N\in\mathbb{N}^{+}$ , and $|\cdot|_{\#}$ denotes the number of elements in a collection, e.g., $|[N]|_{\#}=N$ . Given two functions $g,h\colon\mathbb{N}^{+}\rightarrow\mathbb{R}$ , we denote by $h=\mathbf{\Theta}(g)$ if there exist positive constants $c_{1},c_{2}$ , and $n_{0}$ such that $c_{1}g(n)\leq h(n)\leq c_{2}g(n)$ for every $n\geq n_{0}$ ; $h=\mathcal{O}(g)$ if there exist positive constants $c$ and $n_{0}$ such that $h(n)\leq cg(n)$ for every $n\geq n_{0}$ ; $h=\Omega(g)$ if there exist positive constants $c$ and $n_{0}$ such that $h(n)\geq cg(n)$ for every $n\geq n_{0}$ . We define the globe $\mathcal{B}(r)=\{\bm{x}\mid\|\bm{x}\|_{2}\leq r\}$ for any $r\in\mathbb{R}^{+}$ . Let $\mathbf{I}_{n}$ be the $n\times n$ -dimensional identity matrix. Let $\|\cdot\|_{p}$ be the norm of a vector or matrix, in which we employ $p=2$ as the default. Given $\bm{x}=(x_{1},\dots,x_{n})$ and $\bm{y}=(y_{1},\dots,y_{n})$ , we also define the sup-related measure as $\|\bm{x}-\bm{y}\|_{\alpha}^{\textrm{sup}}=\sup_{i\in[n]}\big{|}x_{i}-y_{i}\big% {|}^{\alpha}$ for $\alpha>0$ .

Let $\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})$ be the space of continuous functions where $n_{0},n\in\mathbb{N}$ . Provided a linear and bounded functional $\mathcal{F}:\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})\to\mathbb{R}$ and a function $f\in\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})$ which satisfies $f(\bm{x})\overset{\underset{\mathrm{d}}{}}{\to}f^{*}$ , then we have $\mathcal{F}(f(\bm{x}))\overset{\underset{\mathrm{d}}{}}{\to}\mathcal{F}(f^{*})$ and $\mathbb{E}\left[\mathcal{F}(f(\bm{x}))\right]\to\mathbb{E}\left[\mathcal{F}(f^% {*})\right]$ according to General Transformation Theorem [26, Theorem 2.3] and Uniform Integrability [4], respectively.

Appendix A Theoretical Derivations of NNGP and NTK

A.1 NNGP and NTK

Here, we consider an $L$ -hidden-layer fully-connected neural networks, where $n_{l}$ and $n_{0}$ indicate the number of neurons in the $l$ -th hidden layer for $l\in[L]$ and input, respectively, as follows

\left\{\leavevmode\nobreak\ \begin{aligned} \bm{s}^{(0)}&=\bm{x}\ ,\\ \bm{h}^{(l)}&=\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\ ,\quad l\in[L]\ ,\\ \bm{s}^{(l)}&=\phi(\bm{h}^{(l)})\ ,\quad l\in[L]\ ,\\ \bm{y}&=\bm{s}^{L}\ ,\end{aligned}\right.

in which $\bm{x}\in\mathbb{R}^{n_{0}}$ and $\bm{y}\in\mathbb{R}^{n_{L}}$ indicate the variables of inputs respectively, $\bm{h}^{(l)}\in\mathbb{R}^{n_{l}}$ and $\bm{s}^{(l)}\in\mathbb{R}^{n_{l}}$ denote the pre-synaptic and post-synaptic variables of the $l$ -th hidden layer respectively, $\mathbf{W}^{(l)}\in\mathbb{R}^{n_{l}\times n_{l-1}}$ and $\bm{b}^{(l)}\in\mathbb{R}^{n_{l}}$ are the parameter variables of connection weights and bias respectively, and $\phi$ is an element-wise activation function. For convenience, we here note the parameter variables at the $t$ -th epoch as $\Theta^{(l)}(t)=[\mathbf{W}^{(l)},\bm{b}^{(l)}]$ , and $\Theta^{(l)}(0)$ denotes the initialized parameters, of which the element obeys the Gaussian distribution $\mathcal{N}(0,\sigma^{2}/n_{l})$ .

Neural Network Gaussian Process (NNGP). For any $l\in[L]$ , there is a claim that the conditional variable $\bm{h}^{(l)}\mid\bm{s}^{(l-1)}$ obeys the Gaussian distribution. In detail, one has

	$\displaystyle\textrm{Var}\left(\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\right)$	$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\right)$
		$\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}% \right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}% \right)\right]^{2}$
		$\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}\right)^{2}+2% \mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}\cdot\bm{b}^{(l)}\right)+\mathbb% {E}\left(\bm{b}^{(l)}\right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^% {(l-1)}\right)\right]^{2}$
		$\displaystyle\quad-2\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}\right)\cdot% \mathbb{E}\left(\bm{b}^{(l)}\right)-\left[\mathbb{E}\left(\bm{b}^{(l)}\right)% \right]^{2}$
		$\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\right)^{2}\mathbb{E}\left(\bm{s% }^{(l-1)}\right)^{2}+\mathbb{E}\left(\bm{b}^{(l)}\right)^{2}$
		$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\bm{s}^% {(l-1)}\right)^{2}+\textrm{Var}\left(\bm{b}^{(l)}\right)\ ,$

where $\cdot^{2}$ and $\cdot$ denote the dot product, and the forth equality holds according to $\mathbb{E}(\mathbf{W}^{(l)})=\mathbf{0}\ ,\quad\mathbb{E}(\bm{b}^{(l)})=\bm{0}$ , and the elements of $\mathbf{W}^{(l)}$ and $\bm{b}^{(l)}$ are mutually independent. According to $\bm{x}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{0}})$ , it is reasonable to assume that $\bm{s}^{(l-1)}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{l-1}}/C_{\phi})$ according to the principle of mathematical induction, where

C_{\phi}=\frac{1}{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left(\phi(z)\right)^{2}}\ .

Hence, one has

\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\sim\mathcal{N}\left(\bm{0},\frac{\sigma^{2}}{n_% {l-1}}\left(\frac{1}{C_{\phi}}+1\right)\mathbf{I}_{n_{l}}\right)\ .

Moreover, the NNGP kernel is defined by

K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=% \mathbb{E}\left\langle\bm{h}^{(l)}\mid\bm{s}^{(l-1)},\bm{h}^{(l)}\mid\bm{s}^{% \prime(l-1)}\right\rangle=\sigma^{2}\leavevmode\nobreak\ \mathbb{E}\left% \langle\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right\rangle+\sigma^{2}

with

\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\bm{h}^{(l)}\mid\bm{s}^{(l% -1)},\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\right\rangle=\sigma^{2}\left(\frac{1}{C_{% \phi}}+1\right)\ .

In summary, we conclude the recursive form of the NNGP kernel as follows

K_{\textrm{NNGP}}^{(l)}\left(\bm{s},\bm{s}^{\prime}\right)=\sigma^{2}% \leavevmode\nobreak\ \mathbb{E}_{\bm{s}\sim\mathcal{N}(\bm{0},K_{\textrm{NNGP}% }^{(l-1)})}\left\langle\bm{s},\bm{s}^{\prime}\right\rangle+\sigma^{2}\ .

\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}f(\bm{x};\Theta)}\frac{\mathop{}% \!\mathrm{d}f(\bm{x};\Theta)}{\mathop{}\!\mathrm{d}\Theta}\ .

The loss $\hbar(\Theta)$ is monotonically decreasing as the training epoch $t$ since

\frac{\partial\hbar(\Theta)}{\partial t}=\frac{\partial\hbar(\Theta)}{\partial% \Theta}\frac{\partial\Theta}{\partial t}=-\nabla_{\Theta}\hbar(\Theta)\cdot% \nabla_{\Theta}\hbar(\Theta)=-\|\nabla_{\Theta}\hbar(\Theta)\|^{2}\leq 0\ .

For any $l\geq 2$ , there is a claim that the gradient variable vector $\bm{h}^{(l)}\mid\bm{s}^{(l-1)}$ obeys the Gaussian distribution. In detail, for $i,j\in\mathbb{N}^{+}$ , one has

	$\displaystyle\textrm{Var}\left(\frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_% {ij}^{(l-1)}}\right)$	$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\mathbf{W}_{ij}^{(l-1)}}\right)$
		$\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\mathbf{W}_{ij}^{(l-1)}}\right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{% (l)}\frac{\partial\bm{s}^{(l-1)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\right)% \right]^{2}$
		$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\right)^{2}$
		$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\right)^{2}\textrm{Var}\left(% \bm{s}^{(l-2)}\right)$

and

	$\displaystyle\textrm{Var}\left(\frac{\partial\bm{h}^{(l)}}{\partial\bm{b}_{i}^% {(l-1)}}\right)$	$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\bm{b}_{i}^{(l-1)}}\right)$
		$\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\bm{b}_{i}^{(l-1)}}\right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{(l)}% \frac{\partial\bm{s}^{(l-1)}}{\partial\bm{b}_{i}^{(l-1)}}\right)\right]^{2}$
		$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\bm{b}_{i}^{(l-1)}}\right)^{2}$
		$\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\right)^{2}\ ,$

where ${\partial\bm{s}^{(l-1)}}/{\partial\bm{h}^{(l-1)}}$ denotes the dot operation. Hence, one has

\frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\sim\mathcal{N}% \left(\bm{0},\frac{\sigma^{2}}{n_{l-1}C^{\prime}_{\phi}C_{\phi}}\mathbf{I}_{n_% {l-1}}\right)\quad\text{and}\quad\frac{\partial\bm{h}^{(l)}}{\partial\bm{b}_{i% }^{(l-1)}}\sim\mathcal{N}\left(\bm{0},\frac{\sigma^{2}}{n_{l-1}C^{\prime}_{% \phi}}\mathbf{I}_{n_{l-1}}\right)\ ,

where

C^{\prime}_{\phi}=\frac{1}{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left[\phi^{% \prime}(z)\right]^{2}}\ .

Moreover, the NTK kernel is defined by

\displaystyle K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)

\displaystyle=K_{\textrm{NTK}}^{(l-1)}\left(\bm{s}^{(l-2)},\bm{s}^{\prime(l-2)% }\right)\mathbb{E}\left\langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l% -1)}},\frac{\partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\right% \rangle+K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right% )\ ,

for $l\geq 2$ and

\displaystyle K_{\textrm{NTK}}^{(1)}\left(\bm{x},\bm{x}^{\prime}\right)=K_{% \textrm{NNGP}}^{(1)}\left(\bm{x},\bm{x}^{\prime}\right)\ ,

provided

\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}% }{\partial\mathbf{W}_{ij}^{(l-1)}},\frac{\partial\bm{h}^{(l)}}{\partial\mathbf% {W}_{ij}^{(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{\prime}_{\phi}C_{\phi}}% \quad\text{and}\quad\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\frac{% \partial\bm{h}^{(l)}}{\partial\bm{b}_{i}^{(l-1)}},\frac{\partial\bm{h}^{(l)}}{% \partial\bm{b}_{i}^{(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{\prime}_{\phi}}\ .

Appendix B Full Proof of Theorem 1 and Theorem 2

All statistics of post-synaptic variables $\bm{s}$ can be calculated via the moment generating function

\mathcal{M}_{\bm{s}}(t)=\int\mathop{}\!\mathrm{e}^{t\bm{s}}f(\bm{s})\mathop{}% \!\mathrm{d}\bm{s}\ .

Here, we focus on the second moment of $s=\bm{s}^{(l)}_{i}$ for $l\in[L]$ and $i\in[n_{l}]$ , that is,

m_{2}(s,t)=\int\frac{t^{2}s^{2}}{2!}\leavevmode\nobreak\ f(s)\mathop{}\!% \mathrm{d}s=\int\frac{t^{2}s^{2}(\Theta)}{2!}\leavevmode\nobreak\ f_{\Theta}(% \Theta)\leavevmode\nobreak\ \frac{\mathop{}\!\mathrm{d}s(\Theta)}{\mathop{}\!% \mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta\ ,

In the above equations, $s$ and $\Theta$ denote the variables of hidden states and parameters, respectively. Let $f_{\Theta_{t}}(\cdot)$ denote the probability density function of $\Theta_{t}$ . According to the formulation of $m_{2}(s)$ , we should compute the probability density function $f_{\Theta}(\Theta)$ . For convenience, we abbreviate $\Theta(t)$ as $\Theta_{t}$ throughout this proof.

According to the introduction in Section 3, Eq. (7) has a general updating formulation, taking Eq. (4) as a special case of $t^{\prime}=0$ . Hence, we here take a general formula as follows

\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta_{t})}{\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda\Theta_{t^{\prime}}\ ,

where $\mathop{}\!\mathrm{d}t$ denotes the epoch infinitesimal. Here, we omit the learning rate for simplicity. Thus, we have

f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iiint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)f_{\Theta_{0}}(z)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y% \!\mathop{}\!\mathrm{d}z

with

\left\{\leavevmode\nobreak\ \begin{aligned} f_{\Theta_{t}}(x)&=\frac{1}{\sigma% _{t}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{t}^{2}}\right)\\ f_{\nabla_{t}}(y)&=\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{y^{2}}{2% \sigma_{y}^{2}}\right)\\ f_{\Theta_{0}}(z)&=\frac{1}{\sigma_{z}\sqrt{2\pi}}\exp\left(-\frac{z^{2}}{2% \sigma_{z}^{2}}\right)\\ \end{aligned}\right.

where $v=u-x+y+\lambda z$ , $\nabla_{t}={\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}/{\mathop{}\!\mathrm{d}% \Theta_{t}}$ , and $\delta(\cdot)$ indicates the Dirac-delta function. Besides, one has

	$\displaystyle\mathrm{Var}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}\right)$	$\displaystyle\leavevmode\nobreak\ =\textrm{Var}\left(\Theta_{t}-\nabla_{t}-% \lambda\Theta_{t^{\prime}}\right)$
		$\displaystyle\leavevmode\nobreak\ =\mathbb{E}\left(\Theta_{t}-\nabla_{t}-% \lambda\Theta_{t^{\prime}}\right)^{2}-\left[\mathbb{E}\left(\Theta_{t}-\nabla_% {t}-\lambda\Theta_{t^{\prime}}\right)\right]^{2}$
		$\displaystyle\leavevmode\nobreak\ =\textrm{Var}\left(\Theta_{t}-\nabla_{t}% \right)+\lambda^{2}\textrm{Var}\left(\Theta_{t^{\prime}}\right)+2\left[\mathbb% {E}\left(\Theta_{t}-\nabla_{t}\right)\mathbb{E}\left(\lambda\Theta_{t^{\prime}% }\right)-\mathbb{E}\left((\Theta_{t}-\nabla_{t})\lambda\Theta_{t^{\prime}}% \right)\right]\ .$

Notice that $\Theta_{t}-\nabla_{t}$ is almost independent to $\Theta_{t^{\prime}}$ as $t\to\infty$ . It is observed that $\mathrm{Var}(\Theta_{t+\mathop{}\!\mathrm{d}t})$ converges as $n\to\infty$ and $t\to\infty$ . Thus, the variable sequence $\{\mathrm{Var}(\Theta_{t})\}_{t}$ is bounded. Here, we define that

\mathrm{Var}(\Theta_{t})\leq\sigma_{t}^{2}\ .

Throughout this proof, we have a mild assumption of $\sigma^{2}=\max_{t}\sigma_{t}^{2}=\min_{t}\sigma_{t}^{2}$ for simplicity; Otherwise, we usually employ $\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}$ , instead of the above assumption, where $\rho_{t,t^{\prime}}$ denotes the correlation coefficient between variables of hidden states $\Theta_{t}$ and $\Theta_{t^{\prime}}$ .

Moreover, we have

f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iiint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)f_{\Theta_{0}}(z)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y% \!\mathop{}\!\mathrm{d}z=\iint_{x,y}f_{\Theta_{t}}(x)f_{\nabla_{t}}(y)\mathop{% }\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y\int_{\Omega_{z}}f_{\Theta_{0}}(z)% \mathop{}\!\mathrm{d}z\ ,

where $\Omega_{z}=\{(x,y)\mid(-u+x-y)/\lambda=0\}$ . Thus, we can conjecture that $\Theta_{t+\mathop{}\!\mathrm{d}t}$ obeys the Gaussian distribution with zero mean. Suppose that $\Theta_{t+\mathop{}\!\mathrm{d}t}\sim\mathcal{N}(0,\sigma_{t+\mathop{}\!% \mathrm{d}t}^{2})$ and

f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(x)=\frac{1}{\sigma_{t+\mathop{}\!\mathrm% {d}t}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{t+\mathop{}\!\mathrm{d}t}^{2% }}\right)\ .

Thus, we have

	$\displaystyle m_{2}(\Theta,t)$	$\displaystyle=\int\frac{t^{2}s^{2}(\Theta)}{2!}\leavevmode\nobreak\ f_{\Theta}% (\Theta)\leavevmode\nobreak\ \frac{\mathop{}\!\mathrm{d}s(\Theta)}{\mathop{}\!% \mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta$
		$\displaystyle=\int\frac{t^{2}s^{2}(\Theta)}{2!}\leavevmode\nobreak\ \frac{1}{% \sigma_{t+\mathop{}\!\mathrm{d}t}\sqrt{2\pi}}\exp\left(-\frac{\Theta^{2}}{2% \sigma_{t+\mathop{}\!\mathrm{d}t}^{2}}\right)\frac{\mathop{}\!\mathrm{d}s(% \Theta)}{\mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta$
		$\displaystyle=\int\frac{t^{2}}{2!}\phi^{2}(h(\Theta))\leavevmode\nobreak\ % \frac{1}{\sigma_{t+\mathop{}\!\mathrm{d}t}\sqrt{2\pi}}\exp\left(-\frac{\Theta^% {2}}{2\sigma_{t+\mathop{}\!\mathrm{d}t}^{2}}\right)\frac{\mathop{}\!\mathrm{d}% \phi(h(\Theta))}{\mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta\ ,$

where $h(\cdot)$ corresponds to $\bm{h}_{i}^{(l)}(\cdot)$ . The above equation can be extended to the vectorized formulation in detail, where provided $s=\bm{s}^{(l)}$ and $h=\bm{h}^{(l)}$ , one has

m_{2}\left(\mathbf{W}^{(l)},t\right)=\int\frac{t^{2}}{2!}\phi^{2}\left(\mathbf% {W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\right)\leavevmode\nobreak\ \frac{1}{\sqrt% {2\pi|\mathbf{\Sigma}_{t}|}}\exp\left(-\frac{\mathbf{W}^{(l)}.^{2}\leavevmode% \nobreak\ \mathbf{\Sigma}_{t}^{-1}}{2}\right)\frac{\mathop{}\!\mathrm{d}\phi(% \bm{h}^{(l)})}{\mathop{}\!\mathrm{d}\bm{h}^{(l)}}\bm{s}^{(l-1)}\mathop{}\!% \mathrm{d}\mathbf{W}^{(l)}\ ,

m_{2}\left(\bm{b}^{(l)},t\right)=\int\frac{t^{2}}{2!}\phi^{2}\left(\mathbf{W}^% {(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\right)\leavevmode\nobreak\ \frac{1}{\sqrt{2% \pi|\mathbf{\Sigma}_{t}|}}\exp\left(-\frac{\bm{b}^{(l)}.^{2}\leavevmode% \nobreak\ \mathbf{\Sigma}_{t}^{-1}}{2}\right)\frac{\mathop{}\!\mathrm{d}\phi(% \bm{h}^{(l)})}{\mathop{}\!\mathrm{d}\bm{h}^{(l)}}\bm{1}_{n_{l}\times 1}\mathop% {}\!\mathrm{d}\bm{b}^{(l)}\ ,

and

m_{2}\left(\Theta,t\right)=\int\frac{t^{2}}{2!}\phi^{2}\left(\bm{h}^{(l)}(% \Theta)\right)\leavevmode\nobreak\ \frac{1}{\sigma_{t}\sqrt{2\pi}}\exp\left(-% \frac{\Theta^{2}}{2\sigma_{t}^{2}}\right)\frac{\mathop{}\!\mathrm{d}\phi(\bm{h% }^{(l)}(\Theta))}{\mathop{}\!\mathrm{d}\bm{h}^{(l)}(\Theta)}\mathbf{W}^{(l)}% \frac{\mathop{}\!\mathrm{d}\bm{s}^{(l-1)}(\Theta)}{\mathop{}\!\mathrm{d}\Theta% }\mathop{}\!\mathrm{d}\Theta\ ,\quad\textrm{otherwise}\ ,

where $\mathbf{\Sigma}_{t}$ indicates the corresponding variance matrix. Furthermore, provided two stamps $t$ and $t+\mathop{}\!\mathrm{d}t$ , we have

	$\displaystyle\mathbb{E}\left\langle\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t% }\right\rangle$	$\displaystyle=m_{2}(\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t},t+\mathop{}\!% \mathrm{d}t,t)$
		$\displaystyle=\iint\frac{t(t+\mathop{}\!\mathrm{d}t)}{2!}\Delta\left(\Theta_{t% +\mathop{}\!\mathrm{d}t},\Theta_{t},t+\mathop{}\!\mathrm{d}t,t\right)f_{\Theta% _{t+\mathop{}\!\mathrm{d}t},\Theta_{t}}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}% ,\Theta_{t}\right)\mathop{}\!\mathrm{d}\Theta_{t+\mathop{}\!\mathrm{d}t}% \mathop{}\!\mathrm{d}\Theta_{t}\ ,$

where

\Delta\left(\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t},t+\mathop{}\!\mathrm{% d}t,t\right)=\phi\left(\bm{h}^{(l)}(\Theta_{t+\mathop{}\!\mathrm{d}t})\right)% \cdot\phi\left(\bm{h}^{\prime(l)}(\Theta_{t})\right)\cdot\frac{\mathop{}\!% \mathrm{d}\phi(\bm{h}^{(l)}\left(\Theta_{t+\mathop{}\!\mathrm{d}t})\right)}{% \mathop{}\!\mathrm{d}\Theta_{t+\mathop{}\!\mathrm{d}t}}\cdot\frac{\mathop{}\!% \mathrm{d}\phi(\bm{h}^{\prime(l)}\left(\Theta_{t})\right)}{\mathop{}\!\mathrm{% d}\Theta_{t}}

and

f_{\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t}}\left(\Theta_{t+\mathop{}\!% \mathrm{d}t},\Theta_{t}\right)=\frac{1}{2\pi\sqrt{1-\rho_{t+\mathop{}\!\mathrm% {d}t,t}^{2}}}\exp\left[\frac{-1}{2(1-\rho_{t+\mathop{}\!\mathrm{d}t,t}^{2})}% \left(\frac{\Theta_{t+\mathop{}\!\mathrm{d}t}}{\sigma_{t+\mathop{}\!\mathrm{d}% t}}-\rho_{t+\mathop{}\!\mathrm{d}t,t}\frac{\Theta_{t}}{\sigma_{t}}\right)^{2}% \right]\ ,

in which $\rho_{t+\mathop{}\!\mathrm{d}t,t}$ denotes the correlation coefficient between $\Theta_{t+\mathop{}\!\mathrm{d}t}$ and $\Theta_{t}$ . The estimation of the second moment has been written as a general formula, which can be solved by some mature statistical methods, such as the replica calculation [16].

By direct calculations, we can obtain the concerned kernel

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sigma^{% 2}}\right)\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}(\Theta_{t})}{% \partial\Theta_{t}},\frac{\partial\bm{h}^{\prime(l)}(\Theta_{t^{\prime}})}{% \partial\Theta_{t^{\prime}}}\right\rangle\ ,

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\mathbb{E}\left% \langle\frac{\partial\bm{h}^{(l)}(\Theta_{t})}{\partial\Theta_{t}},\frac{% \partial\bm{h}^{\prime(l)}(\Theta_{t^{\prime}})}{\partial\Theta_{t^{\prime}}}% \right\rangle\ ,

for $\sigma^{2}\neq\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}$ . Here, $\bm{s}^{(l-1)}$ and $\bm{s}^{\prime(l-1)}$ are variables led by $\Theta_{t}$ and $\Theta_{t^{\prime}}$ , respectively. Similar to the NNGP and NTK kernels, the unified kernel is also of a recursive form as follows:

	$\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{% \prime(l-1)}\right)=$	$\displaystyle\leavevmode\nobreak\ K_{\textrm{UNK}}^{(l-1)}\left(t,t^{\prime},% \bm{s}^{(l-2)},\bm{s}^{\prime(l-2)}\right)\mathbb{E}\left\langle\frac{\partial% \bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\Big{\|}_{\Theta_{t}},\frac{\partial\bm{% s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\Big{\|}_{\Theta_{t^{\prime}}}\right\rangle$		(15)
		$\displaystyle+\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ \|\lambda\|}{% \sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)K_{% \textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(% \Theta_{t^{\prime}})\right)\ .$		(15)

Next, we will analyze the limiting properties of $K_{\textrm{UNK}}^{(l)}$ .

•

In the case of $\lambda=0$ , it is obvious that

\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda=0|}{\sqrt{1-\rho_{% t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)=1\ ,

and thus, Eq. (5) is degenerated as the NTK kernel

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)};% \lambda=0\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1% )}\right)\ .

We provide another proof that originates from Eq. (4) with $\lambda=0$ in Appendix C.

•

In the case of $\lambda\neq 0$ and $t=t^{\prime}$ , one has

\exp\left(\frac{(t-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-\rho_{t,t^{\prime% }}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)=1\ ,

and thus, Eq. (5) equals the NTK kernel

K_{\textrm{UNK}}^{(l)}\left(t,t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=K_{% \textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .

•

In the case of $\lambda\neq 0$ and $t-t^{\prime}\to\infty$ , we conjecture that

\lim\limits_{t-t^{\prime}\to\infty}K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},% \bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\to K_{\textrm{NNGP}}^{(l)}\left(\bm% {s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .

According to Eq. (15), one has

	$\displaystyle\int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l)}$	$\displaystyle\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)% \mathop{}\!\mathrm{d}t$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l-1)% }\left(t,t^{\prime},\bm{s}^{(l-2)},\bm{s}^{\prime(l-2)}\right)\mathbb{E}\left% \langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\Big{\|}_{\Theta_{t% }},\frac{\partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\Big{\|}_{% \Theta_{t^{\prime}}}\right\rangle\mathop{}\!\mathrm{d}t$
		$\displaystyle+\int_{t^{\prime}}^{t}\exp\left(\frac{(t^{\prime}-t)\leavevmode% \nobreak\ \|\lambda\|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{% \prime}}}\right)K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}% ^{\prime(l-1)}(\Theta_{t^{\prime}})\right)\mathop{}\!\mathrm{d}t$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \int_{t^{\prime}}^{t}\left[\frac{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}{\|\lambda\|}\exp\left(% \frac{(t^{\prime}-t)\leavevmode\nobreak\ \|\lambda\|}{\sqrt{1-\rho_{t,t^{\prime}% }^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\right]_{\partial t}K_{\textrm{NNGP% }}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(\Theta_{t^{% \prime}})\right)\mathop{}\!\mathrm{d}t$
		$\displaystyle+\int_{t^{\prime}}^{t}\exp\left(\frac{(t^{\prime}-t)\leavevmode% \nobreak\ \|\lambda\|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{% \prime}}}\right)\left[\frac{\sigma^{2}}{\|\lambda\|}K_{\textrm{NNGP}}^{(l)}\left% (\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(\Theta_{t^{\prime}})\right)% \right]_{\partial t}\mathop{}\!\mathrm{d}t$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma% _{t}\sigma_{t^{\prime}}}{\|\lambda\|}\int_{t^{\prime}}^{t}\left[\exp\left(\frac{% (t^{\prime}-t)\leavevmode\nobreak\ \|\lambda\|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}% \sigma_{t}\sigma_{t^{\prime}}}\right)K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1% )}(\Theta_{t}),\bm{s}^{\prime(l-1)}(\Theta_{t^{\prime}})\right)\right]_{% \partial t}\mathop{}\!\mathrm{d}t\ ,$

where $[\cdot]_{\partial t}$ denotes the differential operation with respect to $t$ . Thus, for any $t^{\prime}$ , it is easy to prove that

\lim\limits_{t\to\infty}\int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l)}\left(t,t^{% \prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\mathop{}\!\mathrm{d}t=\frac% {\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma^{2}}{|\lambda|}K_{\textrm{NNGP}}^{(l)}% \left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .

Here, we consider that the correlation coefficient $\rho_{t,t^{\prime}}$ is negatively proportional to $t-t^{\prime}$ since the variable correlation becomes smaller as the stamp gap increases. Generally, we employ

\rho_{t,t^{\prime}}=\mathbf{\Theta}\left(\frac{1}{t-t^{\prime}}\right)\quad% \textrm{and}\quad\lim\limits_{t-t^{\prime}\to\infty}\frac{\rho_{t,t^{\prime}}}% {t-t^{\prime}}=C\in\mathbb{R}\ .

Thus, we can obtain

\lim\limits_{t-t^{\prime}\to\infty}K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},% \bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=K_{\textrm{NNGP}}^{(l)}\left(\bm{s}% ^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ ,

in which we omit the constant multiplier.

Considering the mild assumption of $\sigma^{2}=\max_{t}\sigma_{t}^{2}=\min_{t}\sigma_{t}^{2}$ , as mentioned above, we can further simplify these conclusions from

\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}\to\sigma^{2}% \quad\textrm{as}\quad t-t^{\prime}\in\mathbb{R}^{+}\ .

This completes the proof. $\hfill\square$

Appendix C For the case of $\lambda=0$

For the case of $\lambda=0$ , we can update $\Theta$ from

\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}\ .

Here, we omit the learning rate for simplicity. For convenience, we abbreviate $\Theta(t)$ as $\Theta_{t}$ . It is observed that

\mathrm{Var}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}\right)=\textrm{Var}\left(% \Theta_{t}-\nabla_{t}\right)=\mathbb{E}\left(\Theta_{t}-\nabla_{t}\right)^{2}-% \left[\mathbb{E}\left(\Theta_{t}-\nabla_{t}\right)\right]^{2}\ .

It is observed that $\mathrm{Var}(\Theta_{t+\mathop{}\!\mathrm{d}t})$ converges as $n\to\infty$ and $t\to\infty$ . Thus, the variable sequence $\{\mathrm{Var}(\Theta_{t})\}_{t}$ is bounded. Here, we define that

\mathrm{Var}(\Theta_{t})\leq\sigma_{t}^{2}\quad\text{and}\quad\sigma^{2}=\max_% {t}\leavevmode\nobreak\ \sigma_{t}^{2}\ .

Let $f_{\Theta_{t}}(\cdot)$ denote the probability density function of $\Theta(t)$ . Thus, we have

f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y\!

with

\left\{\leavevmode\nobreak\ \begin{aligned} f_{\Theta_{t}}(x)&=\frac{1}{\sigma% _{x}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{x}^{2}}\right)\\ f_{\nabla_{t}}(y)&=\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{y^{2}}{2% \sigma_{y}^{2}}\right)\\ \end{aligned}\right.

where $v=u-x+y$ , $\nabla_{t}={\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}/{\mathop{}\!\mathrm{d}% \Theta_{t}}$ , and $\delta(\cdot)$ indicates the Dirac-delta function. Thus, it is feasible to conjecture that $\Theta_{t+\mathop{}\!\mathrm{d}t}$ obeys the Gaussian distribution with zero mean. We define $\Theta_{t+\mathop{}\!\mathrm{d}t}\sim\mathcal{N}(0,\sigma_{u}^{2})$ .

Thus, the second moment in $\mathcal{M}_{\bm{s}}(\cdot)$ becomes

m_{2}(s)=\int s^{2}\leavevmode\nobreak\ f(s)\mathop{}\!\mathrm{d}s=\int s^{2}(% \Theta)\leavevmode\nobreak\ \frac{1}{\sigma_{u}\sqrt{2\pi}}\exp\left(-\frac{s^% {2}(\Theta)}{2\sigma_{u}^{2}}\right)\frac{\mathop{}\!\mathrm{d}s(\Theta)}{% \mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta\ ,

where $s=\bm{s}^{(l)}_{i}$ for $l\in[L]$ and $i\in[n_{l}]$ . Based on the above equations, we can obtain the concerned kernel

K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)};% \lambda=0\right)=\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}(\Theta_{t})}% {\partial\Theta_{t}},\frac{\partial\bm{h}^{\prime(l)}(\Theta_{t^{\prime}})}{% \partial\Theta_{t^{\prime}}}\right\rangle\ ,

which coincides with the theory of NTK and our proposed unified kernel. $\hfill\square$

Appendix D Uniform Tightness of $K_{\textrm{UNK}}^{(l)}$

Lemma 4.4 can be straightforwardly derived from Kolmogorov Continuity Theorem [25], provided the Polish space $(\mathbb{R},|\cdot|)$ .

D.1 Full Proof of Lemma 4.5

It suffices to prove that

1)

$\bm{x}=\bm{0}$ is a tight point of $\bm{s}_{t}(\bm{x})$ ( $t\in[T]$ ) in $\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})$ . This conjecture is self-evident since every probability measure in $(\mathbb{R},|\cdot|)$ is tight [29].
2)

The statistic $(\bm{s}_{1}(\bm{0})+\dots+\bm{s}_{t}(\bm{0}))/t$ converges in distribution as $t\to\infty$ . This conjecture has been proved by Theorem 2.

Therefore, we finish the proof of this lemma. $\hfill\square$

D.2 Full Proof of Lemma 4.6

This proof follows mathematical induction. Before that, we show the following preliminary result. Let $\theta$ be one element of the augmented matrix $(\mathbf{W}^{(l)},\bm{b}^{(l)})$ at the $l$ -th layer, then we can formulate its characteristic function as

\varphi(t)=\mathbb{E}\left[\mathop{}\!\mathrm{e}^{\mathrm{i}\theta t}\right]=% \mathop{}\!\mathrm{e}^{-\eta^{2}t^{2}/2}\quad\text{with}\quad\theta\sim% \mathcal{N}(0,\eta^{2})\ ,

where $\mathrm{i}$ denotes the imaginary unit with $\mathrm{i}=\sqrt{-1}$ . Thus, the variance of hidden random variables at the $l$ -th layer becomes

\sigma^{2}_{l}=\eta^{2}\left[1+\frac{1}{n_{l}}\big{\|}\varphi\circ\bm{s}^{(l-1% )}\big{\|}\right]\ .

(16)

Next, we provide two useful definitions from [28].

Definition D.8

A function $\phi:\mathbb{R}\to\mathbb{R}$ is said to be well-posed, if $\phi$ is first-order differentiable, and its derivative is bounded by a certain constant $C_{\phi}$ . In particular, the commonly used activation functions like ReLU, tanh, and sigmoid are well-posed (see Table 2).

Table 2: Well-posedness of the commonly-used activation functions.

Activations $\phi$	Well-Posedness
ReLU	$\\|\phi^{\prime}(\bm{x})\\|\leq 1$
$\tanh$	$\\|\phi^{\prime}(\bm{x})\\|=\\|1-\sigma^{2}(\bm{x})\\|\leq 1$
sigmoid	$\\|\phi^{\prime}(\bm{x})\\|=\\|\phi(\bm{x})(1-\phi(\bm{x}))\\|\leq 0.25$

Definition D.9

A matrix $\mathbf{W}$ is said to be stable-pertinent for a well-posed activation function $\phi$ , in short $\mathbf{W}\in SP(\phi)$ , if the inequality $C_{\phi}\|\mathbf{W}\|<1$ holds.

Since the activation $\phi$ is a well-posed function and $(\mathbf{W}^{(l)},\bm{b}^{(l)})\in SP(\phi)$ , we affirm that $\phi$ is Lipschitz continuous (with Lipschitz constant $L_{\phi}$ ). Now, we start the mathematical induction. When $t=1$ , for any $\bm{x},\bm{x}^{\prime}\in\mathbb{R}^{n_{0}}$ , we have

\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C_{\eta,% \theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ ,

where $C_{\eta,\theta,\alpha}=\eta^{\alpha}\leavevmode\nobreak\ \mathbb{E}[|\mathcal{% N}(0,1)|^{\alpha}]$ . Per mathematical induction, for $t\geq 1$ , we have

\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C_{\eta,% \theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ .

Thus, one has

\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq\frac{(C_{% \phi})^{\alpha}}{n_{l}}\leavevmode\nobreak\ \mathbb{E}[\leavevmode\nobreak\ |% \mathcal{N}(0,1)|^{\alpha}\leavevmode\nobreak\ ]\leavevmode\nobreak\ \Big{\|}% \bm{s}_{t-1}(\bm{x})-\bm{s}_{t-1}(\bm{x}^{\prime})\Big{\|}_{\alpha}\ ,

(17)

where

	$\displaystyle C_{\phi}$	$\displaystyle=\sigma^{2}_{0}(\bm{x})-2\Sigma_{\bm{x},\bm{x}^{\prime}}+\sigma^{% 2}_{0}(\bm{x}^{\prime})$
		$\displaystyle=\frac{\eta^{2}}{n_{l}}\leavevmode\nobreak\ \Big{\\|}\phi\circ\bm{% s}_{t-1}(\bm{x})-\phi\circ\bm{s}_{t-1}(\bm{x}^{\prime})\Big{\\|}_{2}\qquad\text% {(\leavevmode\nobreak\ from Eq.\leavevmode\nobreak\ \eqref{eq:sigma}% \leavevmode\nobreak\ )}$
		$\displaystyle\leq\frac{\eta^{2}L_{\phi}^{2}}{n_{l}}\leavevmode\nobreak\ \big{% \\|}\bm{s}_{t-1}(\bm{x})-\bm{s}_{t-1}(\bm{x}^{\prime})\big{\\|}_{2}\ .$

Thus, Eq. (17) becomes

\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C^{\prime}_% {\eta,\theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ ,

where

C_{\eta,\theta,\alpha}^{\prime}=\frac{(\eta L_{\phi})^{\alpha}}{n_{l}}\big{\|}% \bm{s}_{t-1}(\bm{x})-\bm{s}_{t-1}(\bm{x}^{\prime})\big{\|}_{\alpha}\leavevmode% \nobreak\ \mathbb{E}[\leavevmode\nobreak\ |\mathcal{N}(0,1)|^{\alpha}% \leavevmode\nobreak\ ]\ .

Iterating this argument, we obtain

\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C_{\eta,% \theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ ,

where

C_{\eta,\theta,\alpha}=\eta^{\alpha(t+1)}L_{\phi}^{\alpha t}\leavevmode% \nobreak\ \mathbb{E}[\leavevmode\nobreak\ |\mathcal{N}(0,1)|^{\alpha}% \leavevmode\nobreak\ ]^{t+1}\ .

The above induction holds for any positive even $\alpha$ . Let $\beta=\alpha-n_{0}>0$ , then this lemma is proved as desired. $\hfill\square$

Appendix E Tight Bound for Convergence

We begin this proof with the following lemmas.

Lemma E.10

Let $f:\mathbb{R}^{n_{0}}\to\mathbb{R}$ be a Lipschitz continuous function with constant $C_{n_{0}}$ and $P_{X}$ denote the Gaussian distribution $\mathcal{N}(0,\eta^{2})$ , then for $\forall\leavevmode\nobreak\ \delta>0$ , there exists $c>0$ , s.t.

\mathbb{P}\left(\left|f(\bm{x})-\int f\left(\bm{x}^{\prime}\right)\mathop{}\!% \mathrm{d}P_{X}\left(\bm{x}^{\prime}\right)\right|>\delta\right)\leq 2\mathop{% }\!\mathrm{e}^{\frac{-c\delta^{2}}{C_{n_{0}}^{2}}}\ .

(18)

Lemma E.10 shows that the Gaussian distribution corresponding to our samples satisfies the log-Sobolev inequality, i.e., Eq. (18), with some constants unrelated to dimension $n_{0}$ . This result also holds for the uniform distributions on the sphere or unit hypercube [18].

Lemma E.11

Suppose that $\bm{x}_{1},\dots,\bm{x}_{N}$ are i.i.d. sampled from $\mathcal{N}(0,\eta^{2})$ , then with probability $1-\delta>0$ , we have

\|\bm{x}_{i}\|_{2}=\mathbf{\Theta}(\sqrt{n_{0}})\quad\text{and}\quad|\langle% \bm{x}_{i},\bm{x}_{j}\rangle|^{r}\leq n_{0}N^{-1/(r-0.5)}\ ,

for $i\neq j$ , where

\delta\leq N\mathop{}\!\mathrm{e}^{-\Omega(n_{0})}+N^{2}\mathop{}\!\mathrm{e}^% {-\Omega\left(n_{0}N^{-2/(r-0.5)}\right)}\ .

From Definition 1 of the manuscript, we have

\int\|\bm{x}\|_{2}^{2}\mathop{}\!\mathrm{d}P_{X}(\bm{x})=\mathbf{\Theta}(n_{0}% )\ .

Since $\bm{x}_{1},\dots,\bm{x}_{n}$ are i.i.d. sampled from $P_{X}=\mathcal{N}(0,\eta^{2})$ , for $\forall$ $i\in[N]$ , we have $\|\bm{x}_{i}\|_{2}^{2}=\mathbf{\Theta}(n_{0})$ with probability at least $1-N\mathop{}\!\mathrm{e}^{\Omega(n_{0})}$ . Provided $\bm{x}_{i}$ , the single-sided inner product $\langle\bm{x}_{i},\cdot\rangle$ is Lipschitz continuous with the constant $C_{n_{0}}=\mathcal{O}(\sqrt{n_{0}})$ . As such, from Lemma E.10, for $\forall\leavevmode\nobreak\ j\neq i$ , we have

\mathbb{P}\left(|\langle\bm{x}_{i},\bm{x}_{j}\rangle|>\delta^{*}\right)\leq 2% \mathop{}\!\mathrm{e}^{-\delta^{2}/C_{n_{0}}^{2}}\ .

Then, for $r\geq 2$ , we have

\mathbb{P}\left(\max_{j\neq i}|\langle\bm{x}_{i},\bm{x}_{j}\rangle|^{r}>\delta% ^{*}\right)\leq N^{2}\mathop{}\!\mathrm{e}^{-\Omega\left({\delta^{*}}^{2}% \right)}.

We complete the proof by setting $\delta^{*}\leq n_{0}N^{-1/(r-0.5)}$ . $\hfill\square$

E.1 Full proof of Theorem 4.7

We start this proof with some notations. For convenience, we force $n=|\bm{s}^{(1)}|_{\#}=|\bm{s}^{(2)}|_{\#}=\dots=|\bm{s}^{(L)}|_{\#}$ , or equally, $n=n_{1}=\dots=n_{L}$ . We also abbreviate the covariance $\mathrm{Cov}(\bm{s}^{(l)},\bm{s}^{(l)})$ as $\mathbf{C}_{l}$ throughout this proof.

Unfolding the $K_{\textrm{UNK}}^{(l)}$ kernel equation that omits the epoch stamp

K_{\textrm{UNK}}^{(l)}(\bm{x}_{i},\bm{x}_{j})=\mathbb{E}[\langle f(\bm{x}_{i};% \bm{\theta}),f(\bm{x}_{j};\bm{\theta})\rangle],\quad\text{for}\quad\bm{x}_{i},% \bm{x}_{j}\in\mathcal{D}\ ,

(19)

we have

K_{\textrm{UNK}}^{(l)}(\bm{x}_{i},\bm{x}_{j})=\frac{1}{M_{\bm{z}}}\left[\sum_{% \kappa}\varphi_{\kappa}+\sum_{\kappa_{1}\neq\kappa_{2}}\phi_{\kappa_{1},\kappa% _{2}}\right]\ ,

(20)

where

\left\{\begin{aligned} &\varphi_{l}=\mathbb{E}\left[\langle\bm{s}^{l},\bm{s}^{% (l)}\rangle\right]\ ,\\ &\psi_{l_{1}l_{2}}=\sum\nolimits_{p,q}\mathbb{E}\left[\bm{s}_{p}^{(l_{1})}\bm{% s}_{q}^{(l_{2})}\right],\quad\text{for}\quad l_{1}\neq l_{2}\ ,\end{aligned}\right.

in which the subscript $p$ indicates the $p$ -th element of vector $\bm{s}^{(l)}$ . From Theorem 1 of the manuscript, the sequence of random variables $\bm{s}^{(l)}$ is weakly dependent with $\beta(t)\to\infty$ as $t\to\infty$ . Thus, $\psi_{l_{1}l_{2}}$ is an infinitesimal with respect to $|l_{2}-l_{1}|$ when $l_{1}\neq l_{2}$ .

Invoking the following equations

\left\{\leavevmode\nobreak\ \begin{aligned} &\chi_{\min}(\mathbf{P}\mathbf{Q})% \geq\chi_{\min}(\mathbf{P})\min_{i\in[m]}\mathbf{Q}(i,i)\\ &\chi_{\min}(\mathbf{P}+\mathbf{Q})\geq\chi_{\min}(\mathbf{P})+\chi_{\min}(% \mathbf{Q})\end{aligned}\right.

into Eq. (20), we have

\chi_{\min}(K_{\textrm{UNK}}^{(l)})\geq\sum\nolimits_{l}\chi_{\min}\left(% \mathbf{C}_{l}\right)\ ,

(21)

and

chi_{\min}\left(\mathbf{C}_{l}\right)\geq\chi_{\min}\left(\mathbf{C}_{l}\right% ),\quad\text{for}\quad l\in[L]\ .

(22)

Iterating Eq. (22) and then invoking it into Eq. (21), we have

\chi_{\min}\left(K_{\textrm{UNK}}^{(l)}\right)\geq\sum_{l}\chi_{\min}\left(% \mathbf{C}_{1}\right)\ .

(23)

From the Hermite expansion [29] of ReLU function, we have

\mu_{r}(\psi)=(-1)^{\frac{r-2}{2}}(r-3)!!/\sqrt{2\pi r!}\ ,

(24)

where $r\geq 2$ indicates the expansion order. Thus, we have

$\displaystyle\chi_{\min}\left(\mathbf{C}_{1}\right)$	$\displaystyle=\chi_{\min}\left(\psi(\mathbf{W}^{(1)}\mathbf{X})\psi(\mathbf{W}% ^{(1)}\mathbf{X})^{\top}\right)$	(25)
	$\displaystyle\geq\mu_{r}(\phi)^{2}\chi_{\min}\left(\mathbf{X}^{(r)}\left(% \mathbf{X}^{(r)}\right)^{\top}\right)$
	$\displaystyle\geq\mu_{r}(\psi)^{2}\left(\min_{i\in[N]}\\|\bm{x}_{i}\\|_{2}^{2r}-% (N-1)\max_{j\neq i}\|\langle\bm{x}_{i},\bm{x}_{j}\rangle\|^{r}\right)$
	$\displaystyle\geq\mu_{r}(\psi)^{2}\leavevmode\nobreak\ \Omega(n_{0})\ ,$

where the superscript $(r)$ denotes the $r$ -th Khatri Rao power of the matrix $\mathbf{X}=[\bm{x}_{1},\dots,\bm{x}_{N}]$ , the first inequality follows from Eq. (24), the second one holds from Gershgorin Circle Theorem [24], and the third one follows from Lemma E.11. Therefore, we can obtain the lower bound of the smallest eigenvalue by plugging Eq. (25) into Eq. (23).

On the other hand, it is observed from Lemma 4.4 that for $l\in[L]$ ,

\left\{\leavevmode\nobreak\ \begin{aligned} &\|\bm{s}_{p}^{(l)}\|^{2}_{2}=% \mathbb{E}_{\mathbf{W}^{(l)}_{p}}\left[\psi(\mathbf{W}^{(l)}_{p}\bm{s}^{(l-1)}% )^{2}\right]=\|\bm{s}_{q}^{(l)}\|^{2},\quad\text{for}\quad\forall q\neq p,\\ &\|\bm{s}^{(l)}\|_{2}^{2}=\mathbb{E}_{\mathbf{W}^{(l)}}\left[\psi(\mathbf{W}^{% (l)}\bm{s}^{(l-1)})^{2}\right]\leq\|\bm{s}^{(l)}\|_{2}^{2}\ .\end{aligned}\right.

(26)

Thus, we have

	$\displaystyle\chi_{\min}(K_{\textrm{UNK}}^{(l)})$	$\displaystyle\leq\frac{\mathop{}\!\mathrm{tr}(K_{\textrm{UNK}}^{(l)})}{N}=% \frac{1}{N}\sum_{i}^{N}K_{\textrm{UNK}}^{(l)}(\bm{x}_{i},\bm{x}_{i})$
		$\displaystyle\leq\frac{1}{N}\sum_{i}^{N}\frac{1}{M_{\bm{z}}}\left[\sum_{l}% \varphi_{l}+\sum_{l_{1}\neq l_{2}}\psi_{l_{1}l_{2}}\right]$
		$\displaystyle\leq\frac{1}{N}\sum_{i}^{N}\left(\frac{1}{l}\sum_{l}\max_{j\in[N]% }\\|\bm{x}_{j}\\|_{2}^{2}+\Omega(n_{0})\right)$
		$\displaystyle\leq\mathbf{\Theta}(n_{0})\ ,$

where the second inequality follows from Eq. (20), the third one follows from Eq. (26), and the fourth one holds from Lemma E.11. This completes the proof. $\hfill\square$

Appendix F Supplementary Experimental Results

This section provides the detailed experimental results. Table 3 lists the optimal trajectory and the corresponding testing accuracy of Grid 0.001 and Grid 0.01 over the epoch. Figure 3 draws the training correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of $\{0.001,0.01,0.1,0,1,10\}$ . Figure 4 draws the testing correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of $\{0.001,0.01,0.1,0,1,10\}$ .

Epoch	Baseline		Grid 0.001			Grid 0.01
$t$	Testing ACC.	Training ACC.	$\lambda^{*}_{t}$	Testing ACC.	Training ACC.	$\lambda^{*}_{t}$	Testing ACC.	Training ACC.
1	0.1325	0.1289	0.0100	0.9287	0.9257	0.0800	0.9291	0.9266
2	0.9284	0.9256	0.0020	0.9515	0.9506	0.0800	0.9527	0.9521
3	0.9514	0.9504	0.0040	0.9607	0.9631	0.0900	0.9631	0.9656
4	0.9603	0.9629	0.0080	0.9665	0.9708	0.0700	0.9693	0.9737
5	0.9658	0.9705	0.0070	0.9709	0.9766	0.0900	0.9729	0.9793
6	0.9705	0.9763	0.0050	0.9738	0.9802	0.1000	0.9757	0.9839
7	0.9733	0.9800	0.0060	0.9756	0.9834	0.1000	0.9785	0.9870
8	0.9753	0.9831	0.0000	0.9772	0.9858	0.0800	0.9795	0.9899
9	0.9769	0.9855	0.0080	0.9789	0.9879	0.0500	0.9805	0.9922
10	0.9788	0.9875	0.0000	0.9798	0.9898	0.0900	0.9818	0.9939
11	0.9800	0.9896	0.0000	0.9809	0.9913	0.0600	0.9826	0.9952
12	0.9809	0.9910	0.0000	0.9814	0.9923	0.0600	0.9833	0.9963
13	0.9813	0.9922	0.0040	0.9814	0.9933	0.0700	0.9833	0.9971
14	0.9814	0.9931	0.0020	0.9815	0.9943	0.0800	0.9837	0.9977
15	0.9815	0.9941	0.0020	0.9815	0.9952	0.0500	0.9841	0.9984
16	0.9814	0.9949	0.0080	0.9819	0.9959	0.0700	0.9848	0.9987
17	0.9816	0.9957	0.0060	0.9824	0.9966	0.0900	0.9847	0.9992
18	0.9818	0.9963	0.0070	0.9827	0.9972	0.0700	0.9851	0.9995
19	0.9825	0.9969	0.0070	0.9830	0.9977	0.0000	0.9850	0.9996
20	0.9824	0.9974	0.0100	0.9833	0.9981	0.0800	0.9857	0.9998
21	0.9831	0.9978	0.0070	0.9834	0.9984	0.0100	0.9847	0.9997
22	0.9830	0.9982	0.0100	0.9838	0.9986	0.0200	0.9850	0.9999
23	0.9831	0.9984	0.0050	0.9835	0.9987	0.0000	0.9847	0.9999
24	0.9834	0.9986	0.0000	0.9836	0.9989	0.0000	0.9843	0.9999
25	0.9835	0.9988	0.0050	0.9830	0.9990	0.0000	0.9848	0.9999
26	0.9837	0.9989	0.0030	0.9838	0.9992	0.0000	0.9845	1.0000
27	0.9834	0.9990	0.0000	0.9834	0.9992	0.0000	0.9852	1.0000
28	0.9833	0.9991	0.0000	0.9839	0.9994	0.0000	0.9848	1.0000
29	0.9834	0.9993	0.0000	0.9834	0.9994	0.0000	0.9848	1.0000
30	0.9836	0.9993	0.0020	0.9838	0.9995	0.0000	0.9850	1.0000

Table 3: Illustration of

\lambda^{*}_{t}

and the corresponding (both training and testing) accuracy (ACC.) of Grid 0.001 and Grid 0.01 over epoch

t

A Unified Kernel for Neural Network Learning

Abstract

keywords:

1 Introduction

2 Preliminary

2.1 Notations

2.2 NNGP and NTK

2.3 Related Studies

3 The Unified Kernel

3.1 Initialization Parameter Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Theorem 1

3.2 Epoch-related Parameter Θt′subscriptΘsuperscript𝑡′\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

Theorem 2

3.3 Proof Sketch

4 Uniform Tightness and Convergence

4.1 Uniform Tightness of NNGP(d)

Theorem 3

Lemma 4.4

Lemma 4.5

Lemma 4.6

4.2 Tight Bound for the Smallest Eigenvalue

Theorem 4.7

5 Experiments

5.1 Datasets and Configurations

5.2 Experiments for Effects of Various Multipliers λ𝜆\lambdaitalic_λ

5.3 Experiments for the UNK kernel

6 Conclusions

Impact Statements

References

Appendix A Theoretical Derivations of NNGP and NTK

A.1 NNGP and NTK

Appendix B Full Proof of Theorem 1 and Theorem 2

Appendix C For the case of λ=0𝜆0\lambda=0italic_λ = 0

Appendix D Uniform Tightness of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT

D.1 Full Proof of Lemma 4.5

D.2 Full Proof of Lemma 4.6

Definition D.8

Definition D.9

Appendix E Tight Bound for Convergence

Lemma E.10

Lemma E.11

E.1 Full proof of Theorem 4.7

Appendix F Supplementary Experimental Results

3.1 Initialization Parameter $\Theta_{0}$

3.2 Epoch-related Parameter $\Theta_{t^{\prime}}$

4.1 Uniform Tightness of NNGP^(d)

5.2 Experiments for Effects of Various Multipliers $\lambda$

Appendix C For the case of $\lambda=0$

Appendix D Uniform Tightness of $K_{\textrm{UNK}}^{(l)}$