A Unified Kernel for Neural Network Learning

Shao-Qun Zhanga,b,111Shao-Qun Zhang is the corresponding author. Email: [email protected]. Other authors made equal contributions.    Zong-Yi Chenb    Yong-Ming Tianb    Xun Lub a National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
b School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China
(May 2, 2024)
Abstract

Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which characterizes the learning dynamics of neural networks with gradient descents and parameter initialization. The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.

keywords:
Neural Network Learning \sepUnified Neural Kernel \sepNeural Network Gaussian Process \sepNeural Tangent Kernel \sepGradient Descent \sepUniform Tightness \sepConvergence \sepOptimal Trajectory

1 Introduction

While neural network learning is successful in a number of applications, it is not yet well understood theoretically (poggio2020theoretical, ). Recently, there has been an increasing amount of literature exploring the correspondence between infinite-wide neural networks and Gaussian processes (neal1996:GP, ). Researchers have identified equivalence between the two in various architectures (garriga2019:GP, ; novak2018:GP, ; yang2019:GP, ). This equivalence facilitates precise approximations of the behavior of infinite-wide Bayesian neural networks without resorting to variational inference. Relatively, it also allows for the characterization of the distribution of randomly initialized neural networks optimized by gradient descent, eliminating the need to actually run an optimizer for such analyses.

The standard investigation in this field encompasses the Neural Network Gaussian Process (NNGP) (lee2018:NNGP, ), which establishes that a neural network converges to a Gaussian process statistically as its width approaches infinity. The NNGP kernel inherently induces a posterior distribution that aligns with the feed-forward inference of infinite-wide Bayesian neural networks employing an i.i.d. Gaussian prior. Another typical work is the Neural Tangent Kernel (NTK) (jacot2018:NTK, ), where the function of a neural network trained through gradient descent converges to the kernel gradient of the functional cost as the width of the neural network tends to infinity. The NTK kernel captures the learning dynamic wherein learned parameters are closely tied to their initialization, resembling an i.i.d. Gaussian prior. These two kernels, derived from neural networks, exhibit distinct characteristics based on different initializations and regularization. A notable contrast lies in the fact that the NNGP, rooted in Bayesian inference, represents a zero-order kernel that are more suitable to describe the overall characteristics of neural network learning. In contrast, the NTK, rooted in the tangent space of gradient descents, is a first-order kernel that is adept at capturing local characteristics of neural network learning. Empirical evidence provided by Lee et al. (lee2020finite, ) demonstrates the divergent generalization performances of these two kernels across various datasets.

In this paper, we undertake an endeavor to unify both the NNGP and NTK kernels and present the Unified Neural Kernel (UNK) as a cohesive framework for neural network learning. By leveraging the learning dynamics associated with gradient descents and parameter initialization, we delve into theoretical characterizations, including but not limited to the existence, limiting properties, uniform tightness, and learning convergence of the proposed UNK kernel. Our theoretical investigations reveal that the UNK kernel exhibits behaviors reminiscent of the NTK kernel with a finite learning step and converges to the NNGP kernel as the learning step approaches infinity. This contribution not only significantly expands the scope of the existing elegant theory connecting kernel learning and neural network learning, but also represents a substantial step toward unraveling the true intricacies of deep learning.

Our main contributions can be summarized as follows:

  • We propose the UNK kernel, built upon the learning dynamics associated with gradient descents and parameter initialization, which unifies the limiting properties of both the NTK and NNGP kernels.

  • We theoretically investigate the asymptotic behaviors of the proposed UNK kernel, in which the UNK kernel is uniformly tight on the space of continuous functions and maintains a tight bound for the smallest eigenvalue.

  • We conduct experiments on benchmark datasets using various configurations. The numerical results further underscore the effectiveness of our proposed method.

The rest of this paper is organized as follows. Section 2 introduces useful notations, terminologies, and related studies. Section 3 presents the UNK kernel with in-depth discussions and proof sketches. Section 4 shows the uniform tightness and convergence of the UNK kernel. Section 5 conducts numerical experiments. Section 6 concludes our work.

2 Preliminary

This section will introduce useful notations, terminologies, and related studies.

2.1 Notations

Let [N]={1,2,,N}delimited-[]𝑁12𝑁[N]=\{1,2,\dots,N\}[ italic_N ] = { 1 , 2 , … , italic_N } be an integer set for N+𝑁superscriptN\in\mathbb{N}^{+}italic_N ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, and ||#|\cdot|_{\#}| ⋅ | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT denotes the number of elements in a collection, e.g., |[N]|#=Nsubscriptdelimited-[]𝑁#𝑁|[N]|_{\#}=N| [ italic_N ] | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT = italic_N. Given two functions g,h:+:𝑔superscriptg,h\colon\mathbb{N}^{+}\rightarrow\mathbb{R}italic_g , italic_h : blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R, we denote by h=𝚯(g)𝚯𝑔h=\mathbf{\Theta}(g)italic_h = bold_Θ ( italic_g ) if there exist positive constants c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that c1g(n)h(n)c2g(n)subscript𝑐1𝑔𝑛𝑛subscript𝑐2𝑔𝑛c_{1}g(n)\leq h(n)\leq c_{2}g(n)italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g ( italic_n ) ≤ italic_h ( italic_n ) ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_g ( italic_n ) for every nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; h=𝒪(g)𝒪𝑔h=\mathcal{O}(g)italic_h = caligraphic_O ( italic_g ) if there exist positive constants c𝑐citalic_c and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that h(n)cg(n)𝑛𝑐𝑔𝑛h(n)\leq cg(n)italic_h ( italic_n ) ≤ italic_c italic_g ( italic_n ) for every nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; h=Ω(g)Ω𝑔h=\Omega(g)italic_h = roman_Ω ( italic_g ) if there exist positive constants c𝑐citalic_c and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that h(n)cg(n)𝑛𝑐𝑔𝑛h(n)\geq cg(n)italic_h ( italic_n ) ≥ italic_c italic_g ( italic_n ) for every nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We define the globe (r)={𝒙𝒙2r}𝑟conditional-set𝒙subscriptnorm𝒙2𝑟\mathcal{B}(r)=\{\bm{x}\mid\|\bm{x}\|_{2}\leq r\}caligraphic_B ( italic_r ) = { bold_italic_x ∣ ∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_r } for any r+𝑟superscriptr\in\mathbb{R}^{+}italic_r ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Let 𝐈nsubscript𝐈𝑛\mathbf{I}_{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the n×n𝑛𝑛n\times nitalic_n × italic_n-dimensional identity matrix. Let p\|\cdot\|_{p}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be the norm of a vector or matrix, in which we employ p=2𝑝2p=2italic_p = 2 as the default. Given 𝒙=(x1,,xn)𝒙subscript𝑥1subscript𝑥𝑛\bm{x}=(x_{1},\dots,x_{n})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and 𝒚=(y1,,yn)𝒚subscript𝑦1subscript𝑦𝑛\bm{y}=(y_{1},\dots,y_{n})bold_italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we also define the sup-related measure as 𝒙𝒚αsup=supi[n]|xiyi|αsuperscriptsubscriptnorm𝒙𝒚𝛼supsubscriptsupremum𝑖delimited-[]𝑛superscriptsubscript𝑥𝑖subscript𝑦𝑖𝛼\|\bm{x}-\bm{y}\|_{\alpha}^{\textrm{sup}}=\sup_{i\in[n]}\big{|}x_{i}-y_{i}\big% {|}^{\alpha}∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT = roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT for α>0𝛼0\alpha>0italic_α > 0.

Let 𝒞(n0;n)𝒞superscriptsubscript𝑛0superscript𝑛\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) be the space of continuous functions where n0,nsubscript𝑛0𝑛n_{0},n\in\mathbb{N}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_n ∈ blackboard_N. Provided a linear and bounded functional :𝒞(n0;n):𝒞superscriptsubscript𝑛0superscript𝑛\mathcal{F}:\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})\to\mathbb{R}caligraphic_F : caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) → blackboard_R and a function f𝒞(n0;n)𝑓𝒞superscriptsubscript𝑛0superscript𝑛f\in\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})italic_f ∈ caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) which satisfies f(𝒙)df𝑓𝒙dabsentsuperscript𝑓f(\bm{x})\overset{\underset{\mathrm{d}}{}}{\to}f^{*}italic_f ( bold_italic_x ) start_OVERACCENT underroman_d start_ARG end_ARG end_OVERACCENT start_ARG → end_ARG italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then we have (f(𝒙))d(f)𝑓𝒙dabsentsuperscript𝑓\mathcal{F}(f(\bm{x}))\overset{\underset{\mathrm{d}}{}}{\to}\mathcal{F}(f^{*})caligraphic_F ( italic_f ( bold_italic_x ) ) start_OVERACCENT underroman_d start_ARG end_ARG end_OVERACCENT start_ARG → end_ARG caligraphic_F ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and 𝔼[(f(𝒙))]𝔼[(f)]𝔼delimited-[]𝑓𝒙𝔼delimited-[]superscript𝑓\mathbb{E}\left[\mathcal{F}(f(\bm{x}))\right]\to\mathbb{E}\left[\mathcal{F}(f^% {*})\right]blackboard_E [ caligraphic_F ( italic_f ( bold_italic_x ) ) ] → blackboard_E [ caligraphic_F ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] according to General Transformation Theorem (van2000asymptotic, , Theorem 2.3) and Uniform Integrability (billingsley2013convergence, ), respectively.

Throughout this paper, we use the specific symbol K𝐾Kitalic_K to denote the concerned kernel for neural network learning. The superscript (l)𝑙(l)( italic_l ) and stamp t𝑡titalic_t are used for recording the indexes of hidden layers and training epochs, respectively. We denote the Gaussian distribution by 𝒩(μx,σx2)𝒩subscript𝜇𝑥superscriptsubscript𝜎𝑥2\mathcal{N}(\mu_{x},\sigma_{x}^{2})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT indicate the mean and variance, respectively. In general, we employ 𝔼()𝔼\mathbb{E}(\cdot)blackboard_E ( ⋅ ) and Var()Var\mathrm{Var}(\cdot)roman_Var ( ⋅ ) to denote the expectation and variance, respectively.

2.2 NNGP and NTK

We start this work with an L𝐿Litalic_L-hidden-layer fully-connected neural networks, where nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT indicate the number of neurons in the l𝑙litalic_l-th hidden layer for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and input, respectively, as follows

{𝒔(0)=𝒙,𝒉(l)=𝐖(l)𝒔(l1)+𝒃(l),l[L],𝒔(l)=ϕ(𝒉(l)),l[L],𝒚=𝒔L,\left\{\leavevmode\nobreak\ \begin{aligned} \bm{s}^{(0)}&=\bm{x}\ ,\\ \bm{h}^{(l)}&=\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\ ,\quad l\in[L]\ ,\\ \bm{s}^{(l)}&=\phi(\bm{h}^{(l)})\ ,\quad l\in[L]\ ,\\ \bm{y}&=\bm{s}^{L}\ ,\end{aligned}\right.{ start_ROW start_CELL bold_italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_CELL start_CELL = bold_italic_x , end_CELL end_ROW start_ROW start_CELL bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL = bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_l ∈ [ italic_L ] , end_CELL end_ROW start_ROW start_CELL bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL = italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , italic_l ∈ [ italic_L ] , end_CELL end_ROW start_ROW start_CELL bold_italic_y end_CELL start_CELL = bold_italic_s start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , end_CELL end_ROW (1)

in which 𝒙n0𝒙superscriptsubscript𝑛0\bm{x}\in\mathbb{R}^{n_{0}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒚nL𝒚superscriptsubscript𝑛𝐿\bm{y}\in\mathbb{R}^{n_{L}}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT indicate the variables of inputs respectively, 𝒉(l)nlsuperscript𝒉𝑙superscriptsubscript𝑛𝑙\bm{h}^{(l)}\in\mathbb{R}^{n_{l}}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒔(l)nlsuperscript𝒔𝑙superscriptsubscript𝑛𝑙\bm{s}^{(l)}\in\mathbb{R}^{n_{l}}bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the pre-synaptic and post-synaptic variables of the l𝑙litalic_l-th hidden layer respectively, 𝐖(l)nl×nl1superscript𝐖𝑙superscriptsubscript𝑛𝑙subscript𝑛𝑙1\mathbf{W}^{(l)}\in\mathbb{R}^{n_{l}\times n_{l-1}}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒃(l)nlsuperscript𝒃𝑙superscriptsubscript𝑛𝑙\bm{b}^{(l)}\in\mathbb{R}^{n_{l}}bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the parameter variables of connection weights and bias respectively, and ϕitalic-ϕ\phiitalic_ϕ is an element-wise activation function. For convenience, we here note the parameter variables at the t𝑡titalic_t-th epoch as Θt(l)=[𝐖(l),𝒃(l)]subscriptsuperscriptΘ𝑙𝑡superscript𝐖𝑙superscript𝒃𝑙\Theta^{(l)}_{t}=[\mathbf{W}^{(l)},\bm{b}^{(l)}]roman_Θ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ], and Θ0(l)subscriptsuperscriptΘ𝑙0\Theta^{(l)}_{0}roman_Θ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the initialized parameters, of which the value obeys the Gaussian distribution 𝒩(0,σ2/nl)𝒩0superscript𝜎2subscript𝑛𝑙\mathcal{N}(0,\sigma^{2}/n_{l})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ).

Neural Network Gaussian Process (NNGP). For any l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], there is a claim that the conditional variable 𝒉(l)𝒔(l1)conditionalsuperscript𝒉𝑙superscript𝒔𝑙1\bm{h}^{(l)}\mid\bm{s}^{(l-1)}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT obeys the Gaussian distribution. In detail, one has Var(𝒉(l)𝒔(l1))=Var(𝐖(l))𝔼(𝒔(l1))2+Var(𝒃(l))Varconditionalsuperscript𝒉𝑙superscript𝒔𝑙1Varsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙12Varsuperscript𝒃𝑙\textrm{Var}(\bm{h}^{(l)}\mid\bm{s}^{(l-1)})=\textrm{Var}(\mathbf{W}^{(l)})% \mathbb{E}(\bm{s}^{(l-1)})^{2}+\textrm{Var}(\bm{b}^{(l)})Var ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + Var ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ), where 2superscript2\cdot^{2}⋅ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and \cdot denote the dot product and this equality holds according to 𝔼(𝐖(l))=𝟎𝔼superscript𝐖𝑙0\mathbb{E}(\mathbf{W}^{(l)})=\mathbf{0}blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = bold_0, 𝔼(𝒃(l))=𝟎𝔼superscript𝒃𝑙0\mathbb{E}(\bm{b}^{(l)})=\bm{0}blackboard_E ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = bold_0, and the mutual independence of elements 𝐖(l)superscript𝐖𝑙\mathbf{W}^{(l)}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and 𝒃(l)superscript𝒃𝑙\bm{b}^{(l)}bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. It is reasonable to conjecture that 𝒔(l1)𝒩(𝟎,𝐈nl1/Cϕ)similar-tosuperscript𝒔𝑙1𝒩0subscript𝐈subscript𝑛𝑙1subscript𝐶italic-ϕ\bm{s}^{(l-1)}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{l-1}}/C_{\phi})bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) according to the principle of mathematical induction and 𝒙𝒩(𝟎,𝐈n0)similar-to𝒙𝒩0subscript𝐈subscript𝑛0\bm{x}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{0}})bold_italic_x ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where Cϕ=1/𝔼z𝒩(0,1)(ϕ(z))2subscript𝐶italic-ϕ1subscript𝔼similar-to𝑧𝒩01superscriptitalic-ϕ𝑧2C_{\phi}={1}/{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left(\phi(z)\right)^{2}}italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1 / blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT ( italic_ϕ ( italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Hence, one has

𝒉(l)𝒔(l1)𝒩(𝟎,σ2nl1(1Cϕ+1)𝐈nl).similar-toconditionalsuperscript𝒉𝑙superscript𝒔𝑙1𝒩0superscript𝜎2subscript𝑛𝑙11subscript𝐶italic-ϕ1subscript𝐈subscript𝑛𝑙\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\sim\mathcal{N}\left(\bm{0},\frac{\sigma^{2}}{n_% {l-1}}\left(\frac{1}{C_{\phi}}+1\right)\mathbf{I}_{n_{l}}\right)\ .bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG + 1 ) bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

Moreover, the NNGP kernel is defined by

KNNGP(l)(𝒔(l1),𝒔(l1))=σ2𝔼𝒔(l1),𝒔(l1)+σ2superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1superscript𝜎2𝔼superscript𝒔𝑙1superscript𝒔𝑙1superscript𝜎2K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=\sigma% ^{2}\leavevmode\nobreak\ \mathbb{E}\left\langle\bm{s}^{(l-1)},\bm{s}^{\prime(l% -1)}\right\rangle+\sigma^{2}italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ⟨ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

with

limnl1𝔼𝒉(l)𝒔(l1),𝒉(l)𝒔(l1)=σ2(1Cϕ+1).subscriptsubscript𝑛𝑙1𝔼quantum-operator-productsuperscript𝒉𝑙superscript𝒔𝑙1superscript𝒉𝑙superscript𝒔𝑙1superscript𝜎21subscript𝐶italic-ϕ1\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\bm{h}^{(l)}\mid\bm{s}^{(l% -1)},\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\right\rangle=\sigma^{2}\left(\frac{1}{C_{% \phi}}+1\right)\ .roman_lim start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT blackboard_E ⟨ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ⟩ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG + 1 ) .

Neural Tangent Kernel (NTK). The training of the concerned ANNs consists in optimizing 𝒚=f(𝒙;Θ)𝒚𝑓𝒙Θ\bm{y}=f(\bm{x};\Theta)bold_italic_y = italic_f ( bold_italic_x ; roman_Θ ) in the function space, supervised by a functional loss (Θ)Planck-constant-over-2-piΘ\hbar(\Theta)roman_ℏ ( roman_Θ ), such as the square or cross-entropy functions, where we employ ΘΘ\Thetaroman_Θ to denote the variable of any parameter

dΘdt=d(Θ)dΘ=d(Θ)df(𝒙;Θ)df(𝒙;Θ)dΘ.dΘd𝑡dPlanck-constant-over-2-piΘdΘdPlanck-constant-over-2-piΘd𝑓𝒙Θd𝑓𝒙ΘdΘ\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}f(\bm{x};\Theta)}\frac{\mathop{}% \!\mathrm{d}f(\bm{x};\Theta)}{\mathop{}\!\mathrm{d}\Theta}\ .divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d italic_f ( bold_italic_x ; roman_Θ ) end_ARG divide start_ARG roman_d italic_f ( bold_italic_x ; roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG .

For any l2𝑙2l\geq 2italic_l ≥ 2, there is a claim that the gradient variable vector 𝒉(l)𝒔(l1)conditionalsuperscript𝒉𝑙superscript𝒔𝑙1\bm{h}^{(l)}\mid\bm{s}^{(l-1)}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT obeys the Gaussian distribution. Taking 𝐖(l1)superscript𝐖𝑙1\mathbf{W}^{(l-1)}bold_W start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT as an example, one has Var(𝒉(l)/𝐖ij(l1))=Var(𝐖(l))𝔼(𝒔(l1)/𝒉(l1))2Var(𝒔(l2))Varsuperscript𝒉𝑙superscriptsubscript𝐖𝑖𝑗𝑙1Varsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙1superscript𝒉𝑙12Varsuperscript𝒔𝑙2\textrm{Var}({\partial\bm{h}^{(l)}}/{\partial\mathbf{W}_{ij}^{(l-1)}})=\textrm% {Var}(\mathbf{W}^{(l)})\mathbb{E}({\partial\bm{s}^{(l-1)}}/{\partial\bm{h}^{(l% -1)}})^{2}\textrm{Var}(\bm{s}^{(l-2)})Var ( ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT / ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT / ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Var ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT ) for i,j+𝑖𝑗superscripti,j\in\mathbb{N}^{+}italic_i , italic_j ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, where 𝒔(l1)/𝒉(l1)superscript𝒔𝑙1superscript𝒉𝑙1{\partial\bm{s}^{(l-1)}}/{\partial\bm{h}^{(l-1)}}∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT / ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT adopts the dot operation. Hence, one has

𝒉(l)𝐖ij(l1)𝒩(𝟎,σ2nl1CϕCϕ𝐈nl1),similar-tosuperscript𝒉𝑙superscriptsubscript𝐖𝑖𝑗𝑙1𝒩0superscript𝜎2subscript𝑛𝑙1subscriptsuperscript𝐶italic-ϕsubscript𝐶italic-ϕsubscript𝐈subscript𝑛𝑙1\frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\sim\mathcal{N}% \left(\bm{0},\frac{\sigma^{2}}{n_{l-1}C^{\prime}_{\phi}C_{\phi}}\mathbf{I}_{n_% {l-1}}\right)\ ,divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ∼ caligraphic_N ( bold_0 , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where Cϕ=1/𝔼z𝒩(0,1)(ϕ(z))2subscriptsuperscript𝐶italic-ϕ1subscript𝔼similar-to𝑧𝒩01superscriptsuperscriptitalic-ϕ𝑧2C^{\prime}_{\phi}={1}/{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left(\phi^{\prime}(z% )\right)^{2}}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1 / blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Moreover, the NTK kernel is defined by

{KNTK(1)(𝒙,𝒙)=KNNGP(1)(𝒙,𝒙),forl=1,KNTK(l)(𝒔(l1),𝒔(l1))=KNTK(l1)(𝒔(l2),𝒔(l2))𝔼𝒔(l1)𝒉(l1),𝒔(l1)𝒉(l1)+KNNGP(l)(𝒔(l1),𝒔(l1)),forl2,\left\{\begin{aligned} K_{\textrm{NTK}}^{(1)}\left(\bm{x},\bm{x}^{\prime}% \right)&=K_{\textrm{NNGP}}^{(1)}\left(\bm{x},\bm{x}^{\prime}\right)\ ,\quad% \text{for}\quad l=1\ ,\\ K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)&=K_{% \textrm{NTK}}^{(l-1)}\left(\bm{s}^{(l-2)},\bm{s}^{\prime(l-2)}\right)\mathbb{E% }\left\langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}},\frac{% \partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\right\rangle\\ &\quad+K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)% \ ,\quad\text{for}\quad l\geq 2\ ,\end{aligned}\right.{ start_ROW start_CELL italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , for italic_l = 1 , end_CELL end_ROW start_ROW start_CELL italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 2 ) end_POSTSUPERSCRIPT ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) , for italic_l ≥ 2 , end_CELL end_ROW

with

{limnl1𝔼𝒉(l)𝐖ij(l1),𝒉(l)𝐖ij(l1)=σ2CϕCϕ,limnl1𝔼𝒉(l)𝒃i(l1),𝒉(l)𝒃i(l1)=σ2Cϕ.\left\{\begin{aligned} &\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle% \frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}},\frac{\partial\bm% {h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{% \prime}_{\phi}C_{\phi}}\ ,\\ &\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)% }}{\partial\bm{b}_{i}^{(l-1)}},\frac{\partial\bm{h}^{(l)}}{\partial\bm{b}_{i}^% {(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{\prime}_{\phi}}\ .\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL roman_lim start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_lim start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG . end_CELL end_ROW

2.3 Related Studies

Past decades have witnessed a growing interest in the correspondence between neural network learning and Gaussian processes. Neal et al. (neal1996:GP, ) presented the seminal work by showing that a one-hidden-layer network of infinite width turns into a Gaussian process. Cho et al. (cho2009:GP, ) linked the multi-layer networks using rectified polynomial activation with compositional Gaussian kernels. Lee et al. (lee2018:NNGP, ) showed that the infinitely wide fully connected neural networks with common-used activation functions can converge to Gaussian processes. Recently, the NNGP has been scaled to many types of networks, including Bayesian networks (novak2018:GP, ), deep networks with convolution (garriga2019:GP, ), and recurrent networks (yang2019:GP, ).

NNGPs can provide a quantitative characterization of how likely certain outcomes are if some aspects of the system are not exactly known. In the experiments of (lee2018:NNGP, ), an explicit estimate in the form of variance prediction is given to each test sample. Besides, Pang et al. (pang2019:NNGP, ) showed that the NNGP is good at handling data with noise and is superior to discretizing differential operators in solving some linear or nonlinear partial differential equations. Park et al. (park2020:NNGP, ) employed the NNGP kernel in the performance measurement of network architectures for the purpose of speeding up the neural architecture search. Pleiss et al. (pleiss2022:NNGP, ) leveraged the effects of width on the capacity of neural networks by decoupling the generalization and width of the corresponding NNGP. Despite great progress, numerous studies about NNGP still rely on increasing width to induce the Gaussian processes. Recently, Zhang et al. (zhang2022:NNGP, ) proposed a depth paradigm that achieves an NNGP by increasing depth, providing complementary support for the existing theory of NNGP.

The NTK kernel, first proposed by Jacot et al. (jacot2018:NTK, ), relates a neural network trained by randomly initialized gradient descent with a Gaussian distribution. It has been proved that many types of networks, including graph neural networks on bioinformatics datasets (du2019:GNTK, ) and convolution neural network (arora2019:NTK, ) on medium-scale datasets like UCI database, can derive a corresponding kernel function. Some researchers applied NTK to various fields, such as federated learning (huang2021:NTK, ), mean-field analysis (mahankali2023:NTK, ), and natural language processing (malladi2023:NTK, ). Recently, Hron et al. (hron2020:attention, ) derived the NNGP and NTK from neural networks to multi-head attention architectures as the number of heads tends to infinity. Avidan et al. (avidan2023:connecting, ) provided a unified theoretical framework that connects NTK and NNGP using the Markov proximal learning model.

3 The Unified Kernel

This work considers a general form of supervised learning

minΘ(Θ)+λ(Θ)subscriptΘPlanck-constant-over-2-piΘ𝜆Θ\min_{\Theta}\quad\hbar(\Theta)+\lambda\mathcal{R}(\Theta)roman_min start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT roman_ℏ ( roman_Θ ) + italic_λ caligraphic_R ( roman_Θ ) (2)

where (Θ)Θ\mathcal{R}(\Theta)caligraphic_R ( roman_Θ ) is a regularizer and λ𝜆\lambdaitalic_λ is the corresponding multiplier. Based on gradient descent, Eq. (2) generally leads to a dynamical system with respect to parameter ΘΘ\Thetaroman_Θ

dΘdt=d(Θ)dΘλd(Θ)dΘ,dΘd𝑡dPlanck-constant-over-2-piΘdΘ𝜆dΘdΘ\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}-\lambda\frac{\mathop{}\!% \mathrm{d}\mathcal{R}(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\ ,divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG - italic_λ divide start_ARG roman_d caligraphic_R ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG , (3)

where we omit the learning rate for simplicity. From Eq. (3), the value of λ𝜆\lambdaitalic_λ can be regarded as a balance between the gradient and regularizer. In the next subsections, we will employ the initialized and epoch-related parameter to implement d(Θ)/dΘdΘdΘ{\mathop{}\!\mathrm{d}\mathcal{R}(\Theta)}/{\mathop{}\!\mathrm{d}\Theta}roman_d caligraphic_R ( roman_Θ ) / roman_d roman_Θ, where both regularization implementations induce the UNK kernel. Furthermore, Subsection 5.2 provides in-depth discussions about the effect of λ𝜆\lambdaitalic_λ on the performance of the UNK kernel.

3.1 Initialization Parameter Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

In this work, we first consider leveraging the effects of initialized parameters222For example, one just employs the square regularizer in Eq. (3)., and thus Eq. (3) becomes

dΘdt=d(Θ)dΘ|tλΘ0,dΘd𝑡evaluated-atdPlanck-constant-over-2-piΘdΘ𝑡𝜆subscriptΘ0\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}-\lambda\Theta% _{0}\ ,divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG | start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (4)

where Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initialized parameter and λ𝜆\lambda\in\mathbb{R}italic_λ ∈ blackboard_R takes a tradeoff between parameter gradient and initialization.

Now, we present our main conclusion as follows.

Theorem 1

For a network of depth L𝐿Litalic_L with a Lipschitz activation ϕitalic-ϕ\phiitalic_ϕ and in the limit of the layer width n1,,nL1subscript𝑛1subscript𝑛𝐿1n_{1},\dots,n_{L-1}\to\inftyitalic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT → ∞, Eq. (4) induces a kernel with the following form, for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and t0𝑡0t\geq 0italic_t ≥ 0,

KUNK(l)(t,𝒔(l1),𝒔(l1))=exp(t|λ|1ρt2σ0σt)𝔼𝒉(l)Θt,𝒉(l)Θt,superscriptsubscript𝐾UNK𝑙𝑡superscript𝒔𝑙1superscript𝒔𝑙1𝑡𝜆1superscriptsubscript𝜌𝑡2subscript𝜎0subscript𝜎𝑡𝔼superscript𝒉𝑙subscriptΘ𝑡superscript𝒉𝑙subscriptΘ𝑡K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=\exp% \left(\frac{-t\leavevmode\nobreak\ |\lambda|}{\sqrt{1-\rho_{t}^{2}}\sigma_{0}% \sigma_{t}}\right)\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}}{\partial% \Theta_{t}},\frac{\partial\bm{h}^{\prime(l)}}{\partial\Theta_{t}}\right\rangle\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = roman_exp ( divide start_ARG - italic_t | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ , (5)

where ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the correlation coefficients of variables along training epoch t𝑡titalic_t, σ02superscriptsubscript𝜎02\sigma_{0}^{2}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the variance and correlation coefficients of variables along training epoch 0 and t𝑡titalic_t, respectively. Furthermore, KUNK(t,,)subscript𝐾UNK𝑡K_{\textrm{UNK}}(t,\cdot,\cdot)italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT ( italic_t , ⋅ , ⋅ ) has the following properties of limiting kernels

  • (i)

    For the case of λ=0𝜆0\lambda=0italic_λ = 0 or t=0𝑡0t=0italic_t = 0, the unified kernel is degenerated as the NTK kernel. Formally, for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], the followings hold

KUNK(l)(t,𝒔(l1),𝒔(l1);λ=0)=KNTK(l)(𝒔(l1),𝒔(l1)),superscriptsubscript𝐾UNK𝑙𝑡superscript𝒔𝑙1superscript𝒔𝑙1𝜆0superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)% };\lambda=0\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l% -1)}\right)\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ; italic_λ = 0 ) = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ,
KUNK(l)(t=0,𝒔(l1),𝒔(l1))=KNTK(l)(𝒔(l1),𝒔(l1)).superscriptsubscript𝐾UNK𝑙𝑡0superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{UNK}}^{(l)}\left(t=0,\bm{s}^{(l-1)},\bm{s}^{\prime(l-% 1)}\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ .italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t = 0 , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .
  • (ii)

    For the case of λ0𝜆0\lambda\neq 0italic_λ ≠ 0 and t𝑡t\to\inftyitalic_t → ∞, the unified kernel equals to the NNGP kernel, i.e., the following holds for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] as t𝑡t\to\inftyitalic_t → ∞

KUNK(l)(t,𝒔(l1),𝒔(l1))KNNGP(l)(𝒔(l1),𝒔(l1)).superscriptsubscript𝐾UNK𝑙𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\to K_% {\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) → italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

Theorem 1 presents the existence and explicit formulation of the unified kernel KUNK(t,,)subscript𝐾UNK𝑡K_{\textrm{UNK}}(t,\cdot,\cdot)italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT ( italic_t , ⋅ , ⋅ ) that corresponds to Eq. (4) for neural network learning. For the case of t=0𝑡0t=0italic_t = 0 or λ=0𝜆0\lambda=0italic_λ = 0, the proposed kernel can be degenerated as the NTK kernel, where the parameter updating obeys the Gaussian distribution. Relatively, for the case of t𝑡t\to\inftyitalic_t → ∞ and λ0𝜆0\lambda\neq 0italic_λ ≠ 0, the proposed kernel can approximate the NNGP kernel well, which implies that a neural network model trained by Eq. (4) can reach an equilibrium state in a long-time regime. The proof sketch is listed in Subsection 3.3, and the full proof can be accessed in Appendix.

Similar to the NNGP and NTK kernels, the unified kernel is also of a recursive form, that is,

KUNK(l)(t,𝒔(l1),𝒔(l1))superscriptsubscript𝐾UNK𝑙𝑡superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)% }\right)italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) =KUNK(l1)(t,𝒔(l2),𝒔(l2))𝔼𝒔(l1)𝒉(l1),𝒔(l1)𝒉(l1)absentsuperscriptsubscript𝐾UNK𝑙1𝑡superscript𝒔𝑙2superscript𝒔𝑙2𝔼superscript𝒔𝑙1superscript𝒉𝑙1superscript𝒔𝑙1superscript𝒉𝑙1\displaystyle=K_{\textrm{UNK}}^{(l-1)}\left(t,\bm{s}^{(l-2)},\bm{s}^{\prime(l-% 2)}\right)\mathbb{E}\left\langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{% (l-1)}},\frac{\partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\right\rangle= italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 2 ) end_POSTSUPERSCRIPT ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ (6)
+exp(t|λ|1ρt2σ0σt)KNNGP(l)(𝒔(l1),𝒔(l1)).𝑡𝜆1superscriptsubscript𝜌𝑡2subscript𝜎0subscript𝜎𝑡superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle\quad+\exp\left(\frac{-t\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t}^{2}}\sigma_{0}\sigma_{t}}\right)K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{% (l-1)},\bm{s}^{\prime(l-1)}\right)\ .+ roman_exp ( divide start_ARG - italic_t | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

3.2 Epoch-related Parameter ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

From Eq. (6), it is observed that the unified kernel of the l𝑙litalic_l-th hidden layer at epoch t𝑡titalic_t can be computed recursively from a combination of the unified kernel of the (l1)𝑙1(l-1)( italic_l - 1 )-th hidden layer at epoch t𝑡titalic_t and the NNGP kernel of the l𝑙litalic_l-th hidden layer at epoch t𝑡titalic_t. Inspired by this recognition, we extend the fundamental formula in Eq. (4) as

dΘdt=d(Θ)dΘ|tλΘtdΘd𝑡evaluated-atdPlanck-constant-over-2-piΘdΘ𝑡𝜆subscriptΘsuperscript𝑡\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}-\lambda\Theta% _{t^{\prime}}divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG | start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (7)

given t<tsuperscript𝑡𝑡t^{\prime}<titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t. Obviously, Eq. (7) has a general updating formulation, taking Eq. (4) as a special case of t=0superscript𝑡0t^{\prime}=0italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0. However, Eq. (7) leads to a more general updating paradigm. For example, ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT may indicate a collection of pre-given parameters from pre-training or meta-learning, so that Eq. (7) becomes an optimization computation for fine-tuning. Further, the derived kernel may support the theoretical analysis of the fine-tuning learning after pre-training. The effectiveness of Eq. (7) will be demonstrated in Section 5.

We directly provide the theoretical framework of unified kernels relative to the parameter updating in Eq. (7).

Theorem 2

For a network of depth L𝐿Litalic_L with a Lipschitz activation ϕitalic-ϕ\phiitalic_ϕ and in the limit of the layer width n1,,nL1subscript𝑛1subscript𝑛𝐿1n_{1},\dots,n_{L-1}\to\inftyitalic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT → ∞, Eq. (7) induces a kernel with the following form, for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and tt𝑡superscript𝑡t\geq t^{\prime}italic_t ≥ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT,

KUNK(l)(t,t,𝒔(l1),𝒔(l1))=exp((tt)|λ|1ρt,t2σtσt)𝔼𝒉(l)Θt,𝒉(l)Θt,superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡𝔼superscript𝒉𝑙subscriptΘ𝑡superscript𝒉𝑙subscriptΘ𝑡K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\mathbb{E}\left% \langle\frac{\partial\bm{h}^{(l)}}{\partial\Theta_{t}},\frac{\partial\bm{h}^{% \prime(l)}}{\partial\Theta_{t}}\right\rangle\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ , (8)

where ρt,tsubscript𝜌𝑡superscript𝑡\rho_{t,t^{\prime}}italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the correlation coefficient of variables along training epochs t𝑡titalic_t and tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎superscript𝑡\sigma_{t^{\prime}}italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are the corresponding variances. Furthermore, the unified kernel KUNK(t,t,,)subscript𝐾UNK𝑡superscript𝑡K_{\textrm{UNK}}(t,t^{\prime},\cdot,\cdot)italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ , ⋅ ) has the following properties

  • (i)

    For the case of λ=0𝜆0\lambda=0italic_λ = 0 or t=t𝑡superscript𝑡t=t^{\prime}italic_t = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the unified kernel degenerates as the NTK kernel, that is, for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ]

KUNK(l)(t,t,𝒔(l1),𝒔(l1);λ=0)=KNTK(l)(𝒔(l1),𝒔(l1)),superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1𝜆0superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{% \prime(l-1)};\lambda=0\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s% }^{\prime(l-1)}\right)\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ; italic_λ = 0 ) = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ,
KUNK(l)(t,t,𝒔(l1),𝒔(l1))=KNTK(l)(𝒔(l1),𝒔(l1)).superscriptsubscript𝐾UNK𝑙𝑡𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-% 1)}\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ .italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .
  • (ii)

    For the case of λ0𝜆0\lambda\neq 0italic_λ ≠ 0 and tt𝑡superscript𝑡t-t^{\prime}\to\inftyitalic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞, the unified kernel equals to the NNGP kernel, i.e., the following holds for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] as tt𝑡superscript𝑡t-t^{\prime}\to\inftyitalic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞,

KUNK(l)(t,t,𝒔(l1),𝒔(l1))KNNGP(l)(𝒔(l1),𝒔(l1)).superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\to K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ .italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) → italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

Theorem 2, a general extension of Theorem 1, presents a unified kernel KUNK(t,t,,)subscript𝐾UNK𝑡superscript𝑡K_{\textrm{UNK}}(t,t^{\prime},\cdot,\cdot)italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ , ⋅ ) for neural network learning with Eq. (7). For the case of t=t𝑡superscript𝑡t=t^{\prime}italic_t = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or λ=0𝜆0\lambda=0italic_λ = 0, the proposed kernel can be degenerated as the NTK kernel, where the parameter updating obeys the Gaussian distribution. Relatively, for the case of tt𝑡superscript𝑡t-t^{\prime}\to\inftyitalic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞ and λ0𝜆0\lambda\neq 0italic_λ ≠ 0, the proposed kernel can approximate the NNGP kernel well, which implies that a neural network model trained by Eq. (7) can reach an equilibrium state in a long time regime. We provide a proof sketch in Subsection 3.3; the full proof can be accessed in Appendix.

It is observed that the unified kernel led by Eq. (7) can be re-written in a recursive form

KUNK(l)(t,t,𝒔(l1),𝒔(l1))superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{% \prime(l-1)}\right)italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) =KUNK(l1)(t,t,𝒔(l2),𝒔(l2))𝔼𝒔(l1)𝒉(l1)|Θt,𝒔(l1)𝒉(l1)|Θtabsentsuperscriptsubscript𝐾UNK𝑙1𝑡superscript𝑡superscript𝒔𝑙2superscript𝒔𝑙2𝔼evaluated-atsuperscript𝒔𝑙1superscript𝒉𝑙1subscriptΘ𝑡evaluated-atsuperscript𝒔𝑙1superscript𝒉𝑙1subscriptΘsuperscript𝑡\displaystyle=K_{\textrm{UNK}}^{(l-1)}\left(t,t^{\prime},\bm{s}^{(l-2)},\bm{s}% ^{\prime(l-2)}\right)\mathbb{E}\left\langle\frac{\partial\bm{s}^{(l-1)}}{% \partial\bm{h}^{(l-1)}}\Big{|}_{\Theta_{t}},\frac{\partial\bm{s}^{\prime(l-1)}% }{\partial\bm{h}^{\prime(l-1)}}\Big{|}_{\Theta_{t^{\prime}}}\right\rangle= italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 2 ) end_POSTSUPERSCRIPT ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ (9)
+exp((tt)|λ|1ρt,t2σtσt)KNNGP(l)(𝒔(l1)(Θt),𝒔(l1)(Θt)).superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1subscriptΘ𝑡superscript𝒔𝑙1subscriptΘsuperscript𝑡\displaystyle\quad+\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda% |}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)K_{% \textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(% \Theta_{t^{\prime}})\right)\ .+ roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) .

3.3 Proof Sketch

It is obvious that Eq. (4) is a special case of Eq. (7) when one forces t=0superscript𝑡0t^{\prime}=0italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0. We start this proof with unfolding Eq. (7) in the following discrete form

Θt+dt=Θtd(Θ)dΘ|tλΘt,subscriptΘ𝑡d𝑡subscriptΘ𝑡evaluated-atdPlanck-constant-over-2-piΘdΘ𝑡𝜆subscriptΘsuperscript𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}-\lambda\Theta_{t^{\prime}}\ ,roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG | start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ,

where t+dt𝑡d𝑡t+\mathop{}\!\mathrm{d}titalic_t + roman_d italic_t and t𝑡titalic_t represent the epoch stamps in which dtd𝑡\mathop{}\!\mathrm{d}troman_d italic_t denotes the epoch infinitesimal. According to the mathematical induction, we can employ ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT drawn from the Gaussian distribution 𝒩(0,σt2)𝒩0superscriptsubscript𝜎superscript𝑡2\mathcal{N}(0,\sigma_{t^{\prime}}^{2})caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). By direct computations, we have

Var(Θt+dt)=Var(Θtt)+λ2Var(Θt)+2[𝔼(Θtt)𝔼(λΘt)𝔼((Θtt)λΘt)],VarsubscriptΘ𝑡d𝑡VarsubscriptΘ𝑡subscript𝑡superscript𝜆2VarsubscriptΘsuperscript𝑡2delimited-[]𝔼subscriptΘ𝑡subscript𝑡𝔼𝜆subscriptΘsuperscript𝑡𝔼subscriptΘ𝑡subscript𝑡𝜆subscriptΘsuperscript𝑡\mathrm{Var}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}\right)=\textrm{Var}\left(% \Theta_{t}-\nabla_{t}\right)+\lambda^{2}\textrm{Var}\left(\Theta_{t^{\prime}}% \right)+2\left[\mathbb{E}\left(\Theta_{t}-\nabla_{t}\right)\mathbb{E}\left(% \lambda\Theta_{t^{\prime}}\right)-\mathbb{E}\left((\Theta_{t}-\nabla_{t})% \lambda\Theta_{t^{\prime}}\right)\right]\ ,roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) = Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Var ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + 2 [ blackboard_E ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_E ( italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - blackboard_E ( ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] ,

where t=d(Θt)/dΘtsubscript𝑡dPlanck-constant-over-2-pisubscriptΘ𝑡dsubscriptΘ𝑡\nabla_{t}={\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}/{\mathop{}\!\mathrm{d}% \Theta_{t}}∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Notice that ΘttsubscriptΘ𝑡subscript𝑡\Theta_{t}-\nabla_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is almost independent to ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as t𝑡t\to\inftyitalic_t → ∞. It is observed that Var(Θt+dt)VarsubscriptΘ𝑡d𝑡\mathrm{Var}(\Theta_{t+\mathop{}\!\mathrm{d}t})roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) converges as n𝑛n\to\inftyitalic_n → ∞ and t𝑡t\to\inftyitalic_t → ∞. Thus, the variable sequence {Var(Θt)}tsubscriptVarsubscriptΘ𝑡𝑡\{\mathrm{Var}(\Theta_{t})\}_{t}{ roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is bounded. Here, we define that Var(Θt)σt2VarsubscriptΘ𝑡superscriptsubscript𝜎𝑡2\mathrm{Var}(\Theta_{t})\leq\sigma_{t}^{2}roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σ2=maxtσt2superscript𝜎2subscript𝑡superscriptsubscript𝜎𝑡2\sigma^{2}=\max_{t}\sigma_{t}^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Let fΘt()subscript𝑓subscriptΘ𝑡f_{\Theta_{t}}(\cdot)italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) denote the probability density function of ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Thus, we have

fΘt+dt(u)=δ(v)fΘt(x)ft(y)fΘ0(z)dxdydzsubscript𝑓subscriptΘ𝑡d𝑡𝑢triple-integral𝛿𝑣subscript𝑓subscriptΘ𝑡𝑥subscript𝑓subscript𝑡𝑦subscript𝑓subscriptΘ0𝑧differential-d𝑥differential-d𝑦differential-d𝑧f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iiint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)f_{\Theta_{0}}(z)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y% \!\mathop{}\!\mathrm{d}zitalic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) = ∭ italic_δ ( italic_v ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) roman_d italic_x roman_d italic_y roman_d italic_z (10)

with

{fΘt(x)=1σx2πexp(x22σx2)ft(y)=1σy2πexp(y22σy2)fΘ0(z)=1σz2πexp(z22σz2)\left\{\leavevmode\nobreak\ \begin{aligned} f_{\Theta_{t}}(x)&=\frac{1}{\sigma% _{x}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{x}^{2}}\right)\\ f_{\nabla_{t}}(y)&=\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{y^{2}}{2% \sigma_{y}^{2}}\right)\\ f_{\Theta_{0}}(z)&=\frac{1}{\sigma_{z}\sqrt{2\pi}}\exp\left(-\frac{z^{2}}{2% \sigma_{z}^{2}}\right)\\ \end{aligned}\right.{ start_ROW start_CELL italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW

where v=ux+y+λz𝑣𝑢𝑥𝑦𝜆𝑧v=u-x+y+\lambda zitalic_v = italic_u - italic_x + italic_y + italic_λ italic_z and δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) indicates the Dirac-delta function. According to the independence, one has

fΘt+dt(u)=x,yfΘt(x)ft(y)dxdyΩzfΘ0(z)dz,subscript𝑓subscriptΘ𝑡d𝑡𝑢subscriptdouble-integral𝑥𝑦subscript𝑓subscriptΘ𝑡𝑥subscript𝑓subscript𝑡𝑦differential-d𝑥differential-d𝑦subscriptsubscriptΩ𝑧subscript𝑓subscriptΘ0𝑧differential-d𝑧f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iint_{x,y}f_{\Theta_{t}}(x)f_{\nabla% _{t}}(y)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y\int_{\Omega_{z}}f_{% \Theta_{0}}(z)\mathop{}\!\mathrm{d}z\ ,italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) = ∬ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) roman_d italic_x roman_d italic_y ∫ start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) roman_d italic_z , (11)

where Ωz={(x,y)(u+xy)/λ=0}subscriptΩ𝑧conditional-set𝑥𝑦𝑢𝑥𝑦𝜆0\Omega_{z}=\{(x,y)\mid(-u+x-y)/\lambda=0\}roman_Ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { ( italic_x , italic_y ) ∣ ( - italic_u + italic_x - italic_y ) / italic_λ = 0 }. Thus, we can claim that Θt+dtsubscriptΘ𝑡d𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT obeys the Gaussian distribution with zero mean, which completes the mathematical induction.

All statistics of post-synaptic variables 𝒔𝒔\bm{s}bold_italic_s can be calculated via the moment generating function 𝒔(a)=ea𝒔f(𝒔)d𝒔subscript𝒔𝑎superscripte𝑎𝒔𝑓𝒔differential-d𝒔\mathcal{M}_{\bm{s}}(a)=\int\mathop{}\!\mathrm{e}^{a\bm{s}}f(\bm{s})\mathop{}% \!\mathrm{d}\bm{s}caligraphic_M start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( italic_a ) = ∫ roman_e start_POSTSUPERSCRIPT italic_a bold_italic_s end_POSTSUPERSCRIPT italic_f ( bold_italic_s ) roman_d bold_italic_s. Here, we focus on the second moment of s=𝒔i(l)𝑠subscriptsuperscript𝒔𝑙𝑖s=\bm{s}^{(l)}_{i}italic_s = bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and i[nl]𝑖delimited-[]subscript𝑛𝑙i\in[n_{l}]italic_i ∈ [ italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ], that is,

m2(s)=s2f(s)ds=s2(Θ)fΘ(Θ)ds(Θ)dΘdΘ.subscript𝑚2𝑠superscript𝑠2𝑓𝑠differential-d𝑠superscript𝑠2Θsubscript𝑓ΘΘd𝑠ΘdΘdifferential-dΘm_{2}(s)=\int s^{2}\leavevmode\nobreak\ f(s)\mathop{}\!\mathrm{d}s=\int s^{2}(% \Theta)\leavevmode\nobreak\ f_{\Theta}(\Theta)\leavevmode\nobreak\ \frac{% \mathop{}\!\mathrm{d}s(\Theta)}{\mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm% {d}\Theta\ .italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s ) = ∫ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_s ) roman_d italic_s = ∫ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Θ ) italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( roman_Θ ) divide start_ARG roman_d italic_s ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ . (12)

By substituting Eq. (10) into Eq. (12), we can obtain the concerned kernel

KUNK(l)(t,t,𝒔(l1),𝒔(l1))=exp((tt)|λ|1ρt,t2σtσt)𝔼𝒉(l)Θt,𝒉(l)Θt,superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡𝔼superscript𝒉𝑙subscriptΘ𝑡superscript𝒉𝑙subscriptΘ𝑡K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\mathbb{E}\left% \langle\frac{\partial\bm{h}^{(l)}}{\partial\Theta_{t}},\frac{\partial\bm{h}^{(% l)}}{\partial\Theta_{t}}\right\rangle\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ ,

which is the desired kernel in Theorem 1.

It is observed that Eq. (5) equals the NTK kernel in the case of λ0𝜆0\lambda\neq 0italic_λ ≠ 0 and t=t𝑡superscript𝑡t=t^{\prime}italic_t = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Similarly, it is easily proved that

limtttKUNK(l)(t,t,𝒔(l1),𝒔(l1))dt=σ2δ(t)|λ|KNNGP(l),subscript𝑡superscriptsubscriptsuperscript𝑡𝑡superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1differential-d𝑡superscript𝜎2𝛿𝑡𝜆superscriptsubscript𝐾NNGP𝑙\lim\limits_{t\to\infty}\int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l)}\left(t,t^{% \prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\mathop{}\!\mathrm{d}t=\frac% {\sigma^{2}\delta(t)}{|\lambda|}K_{\textrm{NNGP}}^{(l)}\ ,roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) roman_d italic_t = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ ( italic_t ) end_ARG start_ARG | italic_λ | end_ARG italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ,

where δ(t)1ρt,t2𝚯((tt)1)proportional-to𝛿𝑡1superscriptsubscript𝜌𝑡superscript𝑡2similar-to𝚯superscript𝑡superscript𝑡1\delta(t)\propto\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sim\mathbf{\Theta}((t-t^{% \prime})^{-1})italic_δ ( italic_t ) ∝ square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∼ bold_Θ ( ( italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). The above formula reveals that a smaller absolute value of λ𝜆\lambdaitalic_λ may lead to a larger convergence rate. Thus, we have

KUNK(l)(t,t,𝒔(l1),𝒔(l1))KNNGP(l)(𝒔(l1),𝒔(l1)),superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\to K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) → italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ,

as tt𝑡superscript𝑡t-t^{\prime}\to\inftyitalic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞. The detailed proof can be accessed in the Appendix. \hfill\square

4 Uniform Tightness and Convergence

Here, we provide two theorems to further show the theoretical properties of the proposed NUK kernel.

4.1 Uniform Tightness of NNGP(d)

Now, we present the following theorem.

Theorem 3

For any l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], the unified kernel KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, described in Theorem 2, is uniformly tight in 𝒞(n0,)𝒞superscriptsubscript𝑛0\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_R ).

Theorem 3 delineates the asymptotic behavior of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT as tt𝑡superscript𝑡t-t^{\prime}\to\inftyitalic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞ for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], revealing an intrinsic characteristic of uniform tightness. Based on Theorem 3, one can obtain the properties of functional limit and continuity of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, in analogy to those of KNNGP(l)superscriptsubscript𝐾NNGP𝑙K_{\textrm{NNGP}}^{(l)}italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bracale2020:asymptotic .

Theorem 3 establishes upon three useful lemmas from (zhang2022:NNGP, ).

Lemma 4.4

Let {𝐬1,𝐬2,,𝐬t}subscript𝐬1subscript𝐬2subscript𝐬𝑡\{\bm{s}_{1},\bm{s}_{2},\dots,\bm{s}_{t}\}{ bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } denote a sequence of random variables in 𝒞(n0,)𝒞superscriptsubscript𝑛0\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_R ). This stochastic process is uniformly tight in 𝒞(n0,)𝒞superscriptsubscript𝑛0\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_R ), if the following two hold: (1) 𝐱=𝟎𝐱0\bm{x}=\bm{0}bold_italic_x = bold_0 is a uniformly tight point of 𝐬t(𝐱)subscript𝐬𝑡𝐱\bm{s}_{t}(\bm{x})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) (t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]) in 𝒞(n0,)𝒞superscriptsubscript𝑛0\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_R ); (2) for any 𝐱,𝐱n0𝐱superscript𝐱superscriptsubscript𝑛0\bm{x},\bm{x}^{\prime}\in\mathbb{R}^{n_{0}}bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], there exist α,β,C>0𝛼𝛽𝐶0\alpha,\beta,C>0italic_α , italic_β , italic_C > 0, such that

𝔼[|𝒔t(𝒙)𝒔t(𝒙)|α]C𝒙𝒙β+n0.𝔼delimited-[]superscriptsubscript𝒔𝑡𝒙subscript𝒔𝑡superscript𝒙𝛼𝐶subscriptnorm𝒙superscript𝒙𝛽subscript𝑛0\mathbb{E}\left[|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{\prime})|^{\alpha}% \right]\leq C\|\bm{x}-\bm{x}^{\prime}\|_{\beta+n_{0}}\ .blackboard_E [ | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] ≤ italic_C ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_β + italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Lemma 4.4 shows core guidance for proving Theorem 3.

Lemma 4.5

Based on the notations of Lemma 4.4, 𝐱=𝟎𝐱0\bm{x}=\bm{0}bold_italic_x = bold_0 is a uniformly tight point of 𝐬t(𝐱)subscript𝐬𝑡𝐱\bm{s}_{t}(\bm{x})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) (t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]) in 𝒞(n0,)𝒞superscriptsubscript𝑛0\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_R ).

The convergence in distribution from Lemma 4.5 paves the way for the convergence of expectations.

Lemma 4.6

Based on the notations of Lemma 4.4, for any 𝐱,𝐱n0𝐱superscript𝐱superscriptsubscript𝑛0\bm{x},\bm{x}^{\prime}\in\mathbb{R}^{n_{0}}bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], there exist α,β,C>0𝛼𝛽𝐶0\alpha,\beta,C>0italic_α , italic_β , italic_C > 0, such that 𝔼[𝐬t(𝐱)𝐬t(𝐱)αsup]C𝐱𝐱β+n0𝔼delimited-[]superscriptsubscriptnormsubscript𝐬𝑡𝐱subscript𝐬𝑡superscript𝐱𝛼sup𝐶subscriptnorm𝐱superscript𝐱𝛽subscript𝑛0\mathbb{E}\left[\|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{\prime})\|_{\alpha}^{% \textrm{sup}}\leavevmode\nobreak\ \right]\leq C\|\bm{x}-\bm{x}^{\prime}\|_{% \beta+n_{0}}blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT ] ≤ italic_C ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_β + italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

The proofs of lemmas above can be accessed from Appendix D. Notice that the above lemmas take the stochastic process of hidden neuron vectors with increasing epochs regardless of the layer index, i.e., the above lemmas hold for 𝒔(l)superscript𝒔𝑙\bm{s}^{(l)}bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT (l[L])𝑙delimited-[]𝐿(l\in[L])( italic_l ∈ [ italic_L ] ). For the case of two stamps t𝑡titalic_t and tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where t<tsuperscript𝑡𝑡t^{\prime}<titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t, the concerned stochastic process becomes {𝒔t,𝒔2,,𝒔t}subscript𝒔superscript𝑡subscript𝒔2subscript𝒔𝑡\{\bm{s}_{t^{\prime}},\bm{s}_{2},\dots,\bm{s}_{t}\}{ bold_italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, and thus the above conclusions also hold. Therefore, Theorem 3 can be completely proved by invoking Lemmas 4.5 and 4.6 into Lemma 4.4.

4.2 Tight Bound for the Smallest Eigenvalue

In this subsection, we investigate the learning convergence of the UNK kernel. The key idea is to bind the small eigenvalues of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] since the learning convergence is related to the positive definiteness of the limiting neural kernels. Here, we consider the neural networks equipped with ReLU activation and then draw the following conclusion.

Theorem 4.7

Let 𝐱1,,𝐱Nsubscript𝐱1subscript𝐱𝑁\bm{x}_{1},\dots,\bm{x}_{N}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be i.i.d. sampled from PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, which satisfies that PX=𝒩(0,η2)subscript𝑃𝑋𝒩0superscript𝜂2P_{X}=\mathcal{N}(0,\eta^{2})italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), 𝐱dP(𝐱)=0𝐱differential-d𝑃𝐱0\int\bm{x}\mathop{}\!\mathrm{d}P\left(\bm{x}\right)=0∫ bold_italic_x roman_d italic_P ( bold_italic_x ) = 0, 𝐱2dP(𝐱)=𝚯(n0)subscriptnorm𝐱2differential-d𝑃𝐱𝚯subscript𝑛0\int\|\bm{x}\|_{2}\mathop{}\!\mathrm{d}P(\bm{x})=\mathbf{\Theta}(\sqrt{n_{0}})∫ ∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_d italic_P ( bold_italic_x ) = bold_Θ ( square-root start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ), and 𝐱22dP(𝐱)=𝚯(n0)superscriptsubscriptnorm𝐱22differential-d𝑃𝐱𝚯subscript𝑛0\int\|\bm{x}\|_{2}^{2}\mathop{}\!\mathrm{d}P(\bm{x})=\mathbf{\Theta}(n_{0})∫ ∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_P ( bold_italic_x ) = bold_Θ ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). For an integer r2𝑟2r\geq 2italic_r ≥ 2, with probability 1δ>01𝛿01-\delta>01 - italic_δ > 0, we have

χmin(KUNK(l))=𝚯(n0)subscript𝜒superscriptsubscript𝐾UNK𝑙𝚯subscript𝑛0\chi_{\min}\left(K_{\textrm{UNK}}^{(l)}\right)=\mathbf{\Theta}(n_{0})italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = bold_Θ ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], where χminsubscript𝜒\chi_{\min}italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT denotes the smallest eigenvalue and

δNeΩ(n0)+N2eΩ(n0N2/(r0.5)).𝛿𝑁superscripteΩsubscript𝑛0superscript𝑁2superscripteΩsubscript𝑛0superscript𝑁2𝑟0.5\delta\leq N\mathop{}\!\mathrm{e}^{-\Omega(n_{0})}+N^{2}\mathop{}\!\mathrm{e}^% {-\Omega(n_{0}N^{-2/(r-0.5)})}\ .italic_δ ≤ italic_N roman_e start_POSTSUPERSCRIPT - roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_e start_POSTSUPERSCRIPT - roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 2 / ( italic_r - 0.5 ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT .

Theorem 4.7 provides a tight bound for the smallest eigenvalue of the UNK kernel KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, which is closely related to the training convergence of neural networks. This nontrivial estimation mirrors the characteristics of this kernel, and usually be used as a key assumption for optimization and generalization. The key idea of proving Theorem 4.7 is based on the following inequalities about the smallest eigenvalue of real-valued symmetric square matrices. Given two symmetric matrices 𝐀,𝐁m×m𝐀𝐁superscript𝑚𝑚\mathbf{A},\mathbf{B}\in\mathbb{R}^{m\times m}bold_A , bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT, it is observed that

{χmin(𝐀𝐁)χmin(𝐀)mini[m]𝐁(i,i),χmin(𝐀+𝐁)χmin(𝐀)+χmin(𝐁).\left\{\begin{aligned} &\chi_{\min}(\mathbf{A}\mathbf{B})\geq\chi_{\min}(% \mathbf{A})\cdot\min_{i\in[m]}\mathbf{B}(i,i)\ ,\\ &\chi_{\min}(\mathbf{A}+\mathbf{B})\geq\chi_{\min}(\mathbf{A})+\chi_{\min}(% \mathbf{B})\ .\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_AB ) ≥ italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_A ) ⋅ roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT bold_B ( italic_i , italic_i ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_A + bold_B ) ≥ italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_A ) + italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_B ) . end_CELL end_ROW (13)

From Eq. (9), we can unfold KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT as a sum of covariance of the sequence of random variables {𝒔(l1)}superscript𝒔𝑙1\{\bm{s}^{(l-1)}\}{ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT }. Thus, we can bound χmin(KUNK(l))subscript𝜒superscriptsubscript𝐾UNK𝑙\chi_{\min}(K_{\textrm{UNK}}^{(l)})italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) by Cov(𝒔(l1),𝒔(l1))Covsuperscript𝒔𝑙1superscript𝒔𝑙1\mathrm{Cov}(\bm{s}^{(l-1)},\bm{s}^{(l-1)})roman_Cov ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) via a chain of feedforward compositions in Eq. (1). For conciseness, we put the proof of Theorem 4.7 into Appendix E.

Refer to caption
Figure 1: The accuracy curves with various multipliers λ{0.001,0.01,0.1,0,1,10}𝜆0.0010.010.10110\lambda\in\{0.001,0.01,0.1,0,1,10\}italic_λ ∈ { 0.001 , 0.01 , 0.1 , 0 , 1 , 10 }, where the x- and y-axes denote the epoch and accuracy, respectively. Training accuracy curves provided (a) Baseline Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, (b) Baseline ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and (c) Grid Search. Testing accuracy curves provided (e) Baseline Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, (f) Baseline ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and (g) Grid Search. Comparison (d) training and (h) testing accuracy curves between Baseline Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Grid 0.001, and Grid 0.01.

5 Experiments

In this section, we conduct several experiments to evaluate the effectiveness of the proposed UNK kernel.

5.1 Datasets and Configurations

Following the experimental configurations of Lee et al. (lee2018:NNGP, ), we conduct the empirical evaluations on a two-hidden-layer MLP trained with various λ𝜆\lambdaitalic_λ. The conducted dataset is the MNIST handwritten digit data, which comprises a training set of 60,000 examples and a testing set of 10,000 examples in 10 classes, where each example is centered in a 28×28282828\times 2828 × 28 image.

For the classification tasks, the class labels are encoded into an opposite regression formation, where the correct label is marked as 0.9 and the incorrect one is marked as 0.1 (zhang2022:NNGP, ). Here, we employ 5000 hidden neurons and the softmax activation function. Similar to (arora2019:NNGP, ), all weights are initialized with a Gaussian distribution of the mean 0 and variance 0.3/nl0.3subscript𝑛𝑙0.3/n_{l}0.3 / italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ]. We also force the batch size and the learning rate as 64 and 0.001, respectively. All experiments were conducted on Intel Core-i7-6500U.

5.2 Experiments for Effects of Various Multipliers λ𝜆\lambdaitalic_λ

The experiments aim to leverage the effects of various λ𝜆\lambdaitalic_λ on the performance of the UNK kernel. According to the recursive formulation of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, it is evident that λ𝜆\lambdaitalic_λ balances the gradient and regularizer. From the perspective of theoretical effects, the absolute value of λ𝜆\lambdaitalic_λ indicates not only the limiting convergence rate of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT but also the optimal solution of Eq. (2). Provided ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we can compute the optimal solution λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at current epoch stamp t𝑡titalic_t as follows

λt=argmint(Θt+dt)(Θt),subscriptsuperscript𝜆𝑡subscriptsuperscript𝑡Planck-constant-over-2-pisubscriptΘ𝑡d𝑡Planck-constant-over-2-pisubscriptΘ𝑡\lambda^{*}_{t}=\arg\min_{t^{\prime}}\leavevmode\nobreak\ \hbar(\Theta_{t+% \mathop{}\!\mathrm{d}t})-\hbar(\Theta_{t})\ ,italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) - roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (14)

where Θt+dt=Θtd(Θt)/dΘtλtΘtsubscriptΘ𝑡d𝑡subscriptΘ𝑡dPlanck-constant-over-2-pisubscriptΘ𝑡dsubscriptΘ𝑡subscript𝜆superscript𝑡subscriptΘsuperscript𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-{\mathop{}\!\mathrm{d}\hbar(% \Theta_{t})}/{\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda_{t^{\prime}}\Theta_{t^{% \prime}}roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. This optimization problem can be solved by some mature algorithms, such as Bayesian optimization or grid search. Here, we conjecture that λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an effective indicator for identifying the optimal trajectory of the UNK kernel.

Here, we set the investigated values of the multiplier λ𝜆\lambdaitalic_λ to {0.001,0.01,0.1,0,1,10}0.0010.010.10110\{0.001,0.01,0.1,0,1,10\}{ 0.001 , 0.01 , 0.1 , 0 , 1 , 10 } and employ three types of studied models as follows

{Baseline Θ0:dΘdt=d(Θt)dΘtλΘ0,Baseline Θt:dΘdt=d(Θt)dΘtλtΘt,Grid Search:dΘdt=d(Θt)dΘtλtΘtdt,\left\{\begin{aligned} \textrm{Baseline $\Theta_{0}$}:&\quad\frac{\mathop{}\!% \mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta_{t})}{\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda\Theta_{0}\ ,\\ \textrm{Baseline $\Theta_{t^{\prime}}$}:&\quad\frac{\mathop{}\!\mathrm{d}% \Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}% {\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda_{t^{\prime}}\Theta_{t^{\prime}}\ ,\\ \textrm{Grid Search}:&\quad\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!% \mathrm{d}t}=-\frac{\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}{\mathop{}\!\mathrm% {d}\Theta_{t}}-\lambda^{*}_{t}\Theta_{t-\mathop{}\!\mathrm{d}t}\ ,\end{aligned% }\right.{ start_ROW start_CELL Baseline roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : end_CELL start_CELL divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - italic_λ roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL Baseline roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : end_CELL start_CELL divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - italic_λ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL Grid Search : end_CELL start_CELL divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t - roman_d italic_t end_POSTSUBSCRIPT , end_CELL end_ROW

where the optimization problem in Eq. (14) is solved by gird search with the granularity of 0.001 and 0.01, which are denoted as Grid 0.001 and Grid 0.01, respectively.

Figure 1 draws various multipliers and the corresponding accuracy curves. There are several observations that (1) the performance of the training algorithms led by Eq. (2) is comparable to those of typical gradient descent in various configurations, (2) λ=1𝜆1\lambda=1italic_λ = 1 and λ=10𝜆10\lambda=10italic_λ = 10 are too large to hamper the performance of the UNK kernel, and (3) Grid 0.01 provides a starting point for higher accuracy and achieves the fastest convergence speed and best accuracy. The above observations not only show the effectiveness of our proposed UNK kernel, but also coincide with our theoretical conclusions that the UNK kernel converges to the NNGP kernel as t𝑡t\to\inftyitalic_t → ∞ and a smaller value of λ𝜆\lambdaitalic_λ may lead to a larger convergence rate.

In detail, Table 1 lists the optimal trajectory and the corresponding training accuracy of Grid 0.001 and Grid 0.01 over the epoch. It is observed that (1) the optimal trajectory of the UNK kernel and the path of typical gradient descent are not completely consistent, and (2) both Grid 0.001 and Grid 0.01 achieve faster convergence speed and better accuracy than those of the baseline methods. These results further demonstrate the effectiveness of our proposed UNK kernel.

Epoch Baseline Grid 0.001 Grid 0.01
t𝑡titalic_t ACC. λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ACC. λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ACC.
1 0.1289 0.0100 0.9257 0.0800 0.9266
2 0.9256 0.0020 0.9506 0.0800 0.9521
3 0.9504 0.0040 0.9631 0.0900 0.9656
4 0.9629 0.0080 0.9708 0.0700 0.9737
5 0.9705 0.0070 0.9766 0.0900 0.9793
6 0.9763 0.0050 0.9802 0.1000 0.9839
7 0.9800 0.0060 0.9834 0.1000 0.9870
8 0.9831 0.0000 0.9858 0.0800 0.9899
9 0.9855 0.0080 0.9879 0.0500 0.9922
10 0.9875 0.0000 0.9898 0.0900 0.9939
11 0.9896 0.0000 0.9913 0.0600 0.9952
12 0.9910 0.0000 0.9923 0.0600 0.9963
13 0.9922 0.0040 0.9933 0.0700 0.9971
14 0.9931 0.0020 0.9943 0.0800 0.9977
15 0.9941 0.0020 0.9952 0.0500 0.9984
16 0.9949 0.0080 0.9959 0.0700 0.9987
17 0.9957 0.0060 0.9966 0.0900 0.9992
18 0.9963 0.0070 0.9972 0.0700 0.9995
19 0.9969 0.0070 0.9977 0.0000 0.9996
20 0.9974 0.0100 0.9981 0.0800 0.9998
21 0.9978 0.0070 0.9984 0.0100 0.9997
22 0.9982 0.0100 0.9986 0.0200 0.9999
23 0.9984 0.0050 0.9987 0.0000 0.9999
24 0.9986 0.0000 0.9989 0.0000 0.9999
25 0.9988 0.0050 0.9990 0.0000 0.9999
26 0.9989 0.0030 0.9992 0.0000 1.0000
Table 1: Illustration of λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the corresponding training accuracy (ACC.) of Grid 0.001 and Grid 0.01 over epoch t𝑡titalic_t.
Refer to caption
Figure 2: Histograms of training correlation of (a) Grid 0.001 and (c) Grid 0.01, testing correlation of (b) Grid 0.001 and (d) Grid 0.01, where x- and y-axes denote the number of instances and the corresponding correlation, respectively.

5.3 Experiments for the UNK kernel

This experiment investigates the representation ability of our proposed UNK kernel. The indicator is computed as

γi2=K(T,0,𝒙i)K(0,0,𝒙i)K(T,T,𝒙i),subscriptsuperscript𝛾2𝑖𝐾𝑇0subscript𝒙𝑖𝐾00subscript𝒙𝑖𝐾𝑇𝑇subscript𝒙𝑖\gamma^{2}_{i}=\frac{K(T,0,\bm{x}_{i})}{K(0,0,\bm{x}_{i})K(T,T,\bm{x}_{i})}\ ,italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_K ( italic_T , 0 , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_K ( 0 , 0 , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_K ( italic_T , italic_T , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,

where 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the i𝑖iitalic_i-th instance, and K(T,0,𝒙i)𝐾𝑇0subscript𝒙𝑖K(T,0,\bm{x}_{i})italic_K ( italic_T , 0 , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the UNK kernel trained by solving Eq. (14)

K(t,t,𝒙i)KUNK(L)(t,t,𝒔iL1(t),𝒔iL1(t);λt).𝐾𝑡superscript𝑡subscript𝒙𝑖superscriptsubscript𝐾UNK𝐿𝑡superscript𝑡subscriptsuperscript𝒔𝐿1𝑖𝑡subscriptsuperscript𝒔𝐿1𝑖superscript𝑡subscriptsuperscript𝜆𝑡K(t,t^{\prime},\bm{x}_{i})\triangleq K_{\textrm{UNK}}^{(L)}\left(t,t^{\prime},% \bm{s}^{L-1}_{i}(t),\bm{s}^{L-1}_{i}(t^{\prime});\lambda^{*}_{t}\right)\ .italic_K ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≜ italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , bold_italic_s start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ; italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The value of γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT manifests the correlation between outputs of the UNK kernels with initialized and optimized parameters. According to the theoretical results in Section 3, the UNK kernel is said to be valid if the kernel outputs brought by initialized and optimized parameters are markedly discriminative. In other words, a valid UNK is able to classify digits well in this experiment, and thus γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should equal 0.1×1=0.10.110.10.1\times 1=0.10.1 × 1 = 0.1, where the first 0.1 and 1 denote the accuracy of the UNK with initialized and optimized parameters, respectively. Ideally, the value of γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in this experiment should trend towards 0.1, that is, 𝔼i(γi)=0.1subscript𝔼𝑖subscript𝛾𝑖0.1\mathbb{E}_{i}(\gamma_{i})=0.1blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0.1. If |γi|subscript𝛾𝑖|\gamma_{i}|| italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | comes near one, the kernel cannot recognize the difference between the kernel output brought by initialized and optimized parameters, and thus the kernel is invalid.

Figure 2 displays the (training and testing) correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of 0.001 and 0.01. It is observed that the average training correlation values of Grid 0.001 and Grid 0.01 are almost 0.13 as training accuracy goes to 100%, which implies that the trained UNK kernel is valid for classifying MNIST. This is a laudable result for the theory and development of neural kernel learning.

Notice that the average training correlation values for Grid 0.001 and Grid 0.01 are not precisely equal to 0.1, and the average testing correlation values for Grid 0.001 and Grid 0.01 are approximately 0.2 instead of the stated value of 0.1. These discrepancies could be attributed to several factors, including gaps between the softmax and labeled vectors and out-of-distribution errors. More detailed experimental results are listed in Appendix F.

6 Conclusions

In this paper, we proposed the UNK kernel, a unified framework for neural network learning that draws upon the learning dynamics associated with gradient descents and parameter initialization. Our investigation explores theoretical aspects, such as the existence, limiting properties, uniform tightness, and learning convergence of the proposed UNK kernel. Our main findings highlight that the UNK kernel exhibits behaviors akin to the NTK kernel with a finite learning step and converges to the NNGP kernel as the learning step approaches infinity. Experimental results further emphasize the effectiveness of our proposed method.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

  • (1) S. Arora, S. S. Du, W. hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32, pages 8141–8150, 2019.
  • (2) S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 322–332, 2019.
  • (3) Y. Avidan, Q. Li, and H. Sompolinsky. Connecting NTK and NNGP: A unified theoretical framework for neural network learning dynamics in the kernel regime. arXiv:2309.04522, 2023.
  • (4) P. Billingsley. Convergence of Probability Measures. John Wiley & Sons, 2013.
  • (5) D. Bracale, S. Favaro, S. Fortini, and S. Peluchetti. Large-width functional asymptotics for deep gaussian neural networks. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • (6) Y. Cho and L. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, pages 342–350, 2009.
  • (7) S. S. Du, K. Hou, R. R. Salakhutdinov, B. Poczos, R. Wang, and K. Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems 32, pages 5723 – 5733, 2019.
  • (8) A. Garriga-Alonso, C. Rasmussen, and L. Aitchison. Deep convolutional networks as shallow gaussian processes. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • (9) J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak. Infinite attention: NNGP and NTK for deep attention networks. In Proceedings of the 37th International Conference on Machine Learning, pages 4376–4386, 2020.
  • (10) B. Huang, X. Li, Z. Song, and X. Yang. FL-NTK: A neural tangent kernel-based framework for federated learning analysis. In Proceedings of the 38th International Conference on Machine Learning, pages 4423–4434, 2021.
  • (11) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, pages 8580 – 8589, 2018.
  • (12) J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • (13) J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein. Finite versus infinite neural networks: An empirical study. In Advances in Neural Information Processing Systems 33, pages 15156–15172, 2020.
  • (14) A. Mahankali, J. Z. Haochen, K. Dong, M. Glasgow, and T. Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. arXiv:2306.16361, 2023.
  • (15) S. Malladi, A. Wettig, D. Yu, D. Chen, and S. Arora. A kernel-based view of language model fine-tuning. In Proceedings of the 40th International Conference on Machine Learning, pages 23610–23641, 2023.
  • (16) M. Mézard, G. Parisi, and M. A. Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications. World Scientific Publishing Company, 1987.
  • (17) R. M. Neal. Priors for infinite networks. Bayesian Learning for Neural Networks, pages 29–53, 1996.
  • (18) Q. Nguyen, M. Mondelli, and G. Montufar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In Proceedings of the 38th International Conference on Machine Learning, pages 8119–8129, 2021.
  • (19) R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, and J. Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • (20) G. Pang, L. Yang, and G. E. Karniadakis. Neural-net-induced gaussian process regression for function approximation and PDE solution. Journal of Computational Physics, 384:270–288, 2019.
  • (21) D. S. Park, J. Lee, D. Peng, Y. Cao, and J. Sohl-Dickstein. Towards NNGP-guided neural architecture search. arXiv:2011.06006, 2020.
  • (22) G. Pleiss and J. P. Cunningham. The limitations of large width in neural networks: A deep gaussian process perspective. In Advances in Neural Information Processing Systems 34, pages 3349–3363, 2021.
  • (23) T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings of the National Academy of Sciences, 117(48):30039–30045, 2020.
  • (24) Hector N Salas. Gershgorin’s theorem for matrices of operators. Linear Algebra and its Applications, 291(1-3):15–36, 1999.
  • (25) D. Stroock and S. Varadhan. Multidimensional Diffusion Processes. Springer Science & Business Media, 1997.
  • (26) A. W. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
  • (27) G. Yang. Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems 32, pages 9951–9960, 2019.
  • (28) S.-Q. Zhang, F. Wang, and F.-L. Fan. Neural network gaussian processes by increasing depth. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • (29) S.-Q. Zhang and Z.-H. Zhou. Arise: Aperiodic semi-parametric process for efficient markets without periodogram and gaussianity assumptions. arXiv:2111.06222, 2021.

Appendix
 
This appendix provides the supplementary materials for our work “A Unified Kernel for Neural Network Learning”, constructed according to the corresponding sections therein. Before that, we first review the useful notations. Let [N]={1,2,,N}delimited-[]𝑁12𝑁[N]=\{1,2,\dots,N\}[ italic_N ] = { 1 , 2 , … , italic_N } be an integer set for N+𝑁superscriptN\in\mathbb{N}^{+}italic_N ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, and ||#|\cdot|_{\#}| ⋅ | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT denotes the number of elements in a collection, e.g., |[N]|#=Nsubscriptdelimited-[]𝑁#𝑁|[N]|_{\#}=N| [ italic_N ] | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT = italic_N. Given two functions g,h:+:𝑔superscriptg,h\colon\mathbb{N}^{+}\rightarrow\mathbb{R}italic_g , italic_h : blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R, we denote by h=𝚯(g)𝚯𝑔h=\mathbf{\Theta}(g)italic_h = bold_Θ ( italic_g ) if there exist positive constants c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that c1g(n)h(n)c2g(n)subscript𝑐1𝑔𝑛𝑛subscript𝑐2𝑔𝑛c_{1}g(n)\leq h(n)\leq c_{2}g(n)italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g ( italic_n ) ≤ italic_h ( italic_n ) ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_g ( italic_n ) for every nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; h=𝒪(g)𝒪𝑔h=\mathcal{O}(g)italic_h = caligraphic_O ( italic_g ) if there exist positive constants c𝑐citalic_c and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that h(n)cg(n)𝑛𝑐𝑔𝑛h(n)\leq cg(n)italic_h ( italic_n ) ≤ italic_c italic_g ( italic_n ) for every nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; h=Ω(g)Ω𝑔h=\Omega(g)italic_h = roman_Ω ( italic_g ) if there exist positive constants c𝑐citalic_c and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that h(n)cg(n)𝑛𝑐𝑔𝑛h(n)\geq cg(n)italic_h ( italic_n ) ≥ italic_c italic_g ( italic_n ) for every nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We define the globe (r)={𝒙𝒙2r}𝑟conditional-set𝒙subscriptnorm𝒙2𝑟\mathcal{B}(r)=\{\bm{x}\mid\|\bm{x}\|_{2}\leq r\}caligraphic_B ( italic_r ) = { bold_italic_x ∣ ∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_r } for any r+𝑟superscriptr\in\mathbb{R}^{+}italic_r ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Let 𝐈nsubscript𝐈𝑛\mathbf{I}_{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the n×n𝑛𝑛n\times nitalic_n × italic_n-dimensional identity matrix. Let p\|\cdot\|_{p}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be the norm of a vector or matrix, in which we employ p=2𝑝2p=2italic_p = 2 as the default. Given 𝒙=(x1,,xn)𝒙subscript𝑥1subscript𝑥𝑛\bm{x}=(x_{1},\dots,x_{n})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and 𝒚=(y1,,yn)𝒚subscript𝑦1subscript𝑦𝑛\bm{y}=(y_{1},\dots,y_{n})bold_italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we also define the sup-related measure as 𝒙𝒚αsup=supi[n]|xiyi|αsuperscriptsubscriptnorm𝒙𝒚𝛼supsubscriptsupremum𝑖delimited-[]𝑛superscriptsubscript𝑥𝑖subscript𝑦𝑖𝛼\|\bm{x}-\bm{y}\|_{\alpha}^{\textrm{sup}}=\sup_{i\in[n]}\big{|}x_{i}-y_{i}\big% {|}^{\alpha}∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT = roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT for α>0𝛼0\alpha>0italic_α > 0.

Let 𝒞(n0;n)𝒞superscriptsubscript𝑛0superscript𝑛\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) be the space of continuous functions where n0,nsubscript𝑛0𝑛n_{0},n\in\mathbb{N}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_n ∈ blackboard_N. Provided a linear and bounded functional :𝒞(n0;n):𝒞superscriptsubscript𝑛0superscript𝑛\mathcal{F}:\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})\to\mathbb{R}caligraphic_F : caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) → blackboard_R and a function f𝒞(n0;n)𝑓𝒞superscriptsubscript𝑛0superscript𝑛f\in\mathcal{C}(\mathbb{R}^{n_{0}};\mathbb{R}^{n})italic_f ∈ caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) which satisfies f(𝒙)df𝑓𝒙dabsentsuperscript𝑓f(\bm{x})\overset{\underset{\mathrm{d}}{}}{\to}f^{*}italic_f ( bold_italic_x ) start_OVERACCENT underroman_d start_ARG end_ARG end_OVERACCENT start_ARG → end_ARG italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then we have (f(𝒙))d(f)𝑓𝒙dabsentsuperscript𝑓\mathcal{F}(f(\bm{x}))\overset{\underset{\mathrm{d}}{}}{\to}\mathcal{F}(f^{*})caligraphic_F ( italic_f ( bold_italic_x ) ) start_OVERACCENT underroman_d start_ARG end_ARG end_OVERACCENT start_ARG → end_ARG caligraphic_F ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and 𝔼[(f(𝒙))]𝔼[(f)]𝔼delimited-[]𝑓𝒙𝔼delimited-[]superscript𝑓\mathbb{E}\left[\mathcal{F}(f(\bm{x}))\right]\to\mathbb{E}\left[\mathcal{F}(f^% {*})\right]blackboard_E [ caligraphic_F ( italic_f ( bold_italic_x ) ) ] → blackboard_E [ caligraphic_F ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] according to General Transformation Theorem [26, Theorem 2.3] and Uniform Integrability [4], respectively.

Throughout this paper, we use the specific symbol K𝐾Kitalic_K to denote the concerned kernel for neural network learning. The superscript (l)𝑙(l)( italic_l ) and stamp t𝑡titalic_t are used for recording the indexes of hidden layers and training epochs, respectively. We denote the Gaussian distribution by 𝒩(μx,σx2)𝒩subscript𝜇𝑥superscriptsubscript𝜎𝑥2\mathcal{N}(\mu_{x},\sigma_{x}^{2})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT indicate the mean and variance, respectively. In general, we employ 𝔼()𝔼\mathbb{E}(\cdot)blackboard_E ( ⋅ ) and Var()Var\mathrm{Var}(\cdot)roman_Var ( ⋅ ) to denote the expectation and variance, respectively.

Appendix A Theoretical Derivations of NNGP and NTK

A.1 NNGP and NTK

Here, we consider an L𝐿Litalic_L-hidden-layer fully-connected neural networks, where nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT indicate the number of neurons in the l𝑙litalic_l-th hidden layer for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and input, respectively, as follows

{𝒔(0)=𝒙,𝒉(l)=𝐖(l)𝒔(l1)+𝒃(l),l[L],𝒔(l)=ϕ(𝒉(l)),l[L],𝒚=𝒔L,\left\{\leavevmode\nobreak\ \begin{aligned} \bm{s}^{(0)}&=\bm{x}\ ,\\ \bm{h}^{(l)}&=\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\ ,\quad l\in[L]\ ,\\ \bm{s}^{(l)}&=\phi(\bm{h}^{(l)})\ ,\quad l\in[L]\ ,\\ \bm{y}&=\bm{s}^{L}\ ,\end{aligned}\right.{ start_ROW start_CELL bold_italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_CELL start_CELL = bold_italic_x , end_CELL end_ROW start_ROW start_CELL bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL = bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_l ∈ [ italic_L ] , end_CELL end_ROW start_ROW start_CELL bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL = italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , italic_l ∈ [ italic_L ] , end_CELL end_ROW start_ROW start_CELL bold_italic_y end_CELL start_CELL = bold_italic_s start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , end_CELL end_ROW

in which 𝒙n0𝒙superscriptsubscript𝑛0\bm{x}\in\mathbb{R}^{n_{0}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒚nL𝒚superscriptsubscript𝑛𝐿\bm{y}\in\mathbb{R}^{n_{L}}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT indicate the variables of inputs respectively, 𝒉(l)nlsuperscript𝒉𝑙superscriptsubscript𝑛𝑙\bm{h}^{(l)}\in\mathbb{R}^{n_{l}}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒔(l)nlsuperscript𝒔𝑙superscriptsubscript𝑛𝑙\bm{s}^{(l)}\in\mathbb{R}^{n_{l}}bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the pre-synaptic and post-synaptic variables of the l𝑙litalic_l-th hidden layer respectively, 𝐖(l)nl×nl1superscript𝐖𝑙superscriptsubscript𝑛𝑙subscript𝑛𝑙1\mathbf{W}^{(l)}\in\mathbb{R}^{n_{l}\times n_{l-1}}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒃(l)nlsuperscript𝒃𝑙superscriptsubscript𝑛𝑙\bm{b}^{(l)}\in\mathbb{R}^{n_{l}}bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the parameter variables of connection weights and bias respectively, and ϕitalic-ϕ\phiitalic_ϕ is an element-wise activation function. For convenience, we here note the parameter variables at the t𝑡titalic_t-th epoch as Θ(l)(t)=[𝐖(l),𝒃(l)]superscriptΘ𝑙𝑡superscript𝐖𝑙superscript𝒃𝑙\Theta^{(l)}(t)=[\mathbf{W}^{(l)},\bm{b}^{(l)}]roman_Θ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t ) = [ bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ], and Θ(l)(0)superscriptΘ𝑙0\Theta^{(l)}(0)roman_Θ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( 0 ) denotes the initialized parameters, of which the element obeys the Gaussian distribution 𝒩(0,σ2/nl)𝒩0superscript𝜎2subscript𝑛𝑙\mathcal{N}(0,\sigma^{2}/n_{l})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ).

Neural Network Gaussian Process (NNGP). For any l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], there is a claim that the conditional variable 𝒉(l)𝒔(l1)conditionalsuperscript𝒉𝑙superscript𝒔𝑙1\bm{h}^{(l)}\mid\bm{s}^{(l-1)}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT obeys the Gaussian distribution. In detail, one has

Var(𝒉(l)𝒔(l1))Varconditionalsuperscript𝒉𝑙superscript𝒔𝑙1\displaystyle\textrm{Var}\left(\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\right)Var ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) =Var(𝐖(l)𝒔(l1)+𝒃(l))absentVarsuperscript𝐖𝑙superscript𝒔𝑙1superscript𝒃𝑙\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\right)= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT )
=𝔼(𝐖(l)𝒔(l1)+𝒃(l))2[𝔼(𝐖(l)𝒔(l1)+𝒃(l))]2absent𝔼superscriptsuperscript𝐖𝑙superscript𝒔𝑙1superscript𝒃𝑙2superscriptdelimited-[]𝔼superscript𝐖𝑙superscript𝒔𝑙1superscript𝒃𝑙2\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}% \right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}% \right)\right]^{2}= blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼(𝐖(l)𝒔(l1))2+2𝔼(𝐖(l)𝒔(l1)𝒃(l))+𝔼(𝒃(l))2[𝔼(𝐖(l)𝒔(l1))]2absent𝔼superscriptsuperscript𝐖𝑙superscript𝒔𝑙122𝔼superscript𝐖𝑙superscript𝒔𝑙1superscript𝒃𝑙𝔼superscriptsuperscript𝒃𝑙2superscriptdelimited-[]𝔼superscript𝐖𝑙superscript𝒔𝑙12\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}\right)^{2}+2% \mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}\cdot\bm{b}^{(l)}\right)+\mathbb% {E}\left(\bm{b}^{(l)}\right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^% {(l-1)}\right)\right]^{2}= blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ⋅ bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) + blackboard_E ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2𝔼(𝐖(l)𝒔(l1))𝔼(𝒃(l))[𝔼(𝒃(l))]22𝔼superscript𝐖𝑙superscript𝒔𝑙1𝔼superscript𝒃𝑙superscriptdelimited-[]𝔼superscript𝒃𝑙2\displaystyle\quad-2\mathbb{E}\left(\mathbf{W}^{(l)}\bm{s}^{(l-1)}\right)\cdot% \mathbb{E}\left(\bm{b}^{(l)}\right)-\left[\mathbb{E}\left(\bm{b}^{(l)}\right)% \right]^{2}- 2 blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ⋅ blackboard_E ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) - [ blackboard_E ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼(𝐖(l))2𝔼(𝒔(l1))2+𝔼(𝒃(l))2absent𝔼superscriptsuperscript𝐖𝑙2𝔼superscriptsuperscript𝒔𝑙12𝔼superscriptsuperscript𝒃𝑙2\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\right)^{2}\mathbb{E}\left(\bm{s% }^{(l-1)}\right)^{2}+\mathbb{E}\left(\bm{b}^{(l)}\right)^{2}= blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Var(𝐖(l))𝔼(𝒔(l1))2+Var(𝒃(l)),absentVarsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙12Varsuperscript𝒃𝑙\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\bm{s}^% {(l-1)}\right)^{2}+\textrm{Var}\left(\bm{b}^{(l)}\right)\ ,= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + Var ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ,

where 2superscript2\cdot^{2}⋅ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and \cdot denote the dot product, and the forth equality holds according to 𝔼(𝐖(l))=𝟎,𝔼(𝒃(l))=𝟎formulae-sequence𝔼superscript𝐖𝑙0𝔼superscript𝒃𝑙0\mathbb{E}(\mathbf{W}^{(l)})=\mathbf{0}\ ,\quad\mathbb{E}(\bm{b}^{(l)})=\bm{0}blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = bold_0 , blackboard_E ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = bold_0, and the elements of 𝐖(l)superscript𝐖𝑙\mathbf{W}^{(l)}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and 𝒃(l)superscript𝒃𝑙\bm{b}^{(l)}bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are mutually independent. According to 𝒙𝒩(𝟎,𝐈n0)similar-to𝒙𝒩0subscript𝐈subscript𝑛0\bm{x}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{0}})bold_italic_x ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), it is reasonable to assume that 𝒔(l1)𝒩(𝟎,𝐈nl1/Cϕ)similar-tosuperscript𝒔𝑙1𝒩0subscript𝐈subscript𝑛𝑙1subscript𝐶italic-ϕ\bm{s}^{(l-1)}\sim\mathcal{N}(\bm{0},\mathbf{I}_{n_{l-1}}/C_{\phi})bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) according to the principle of mathematical induction, where

Cϕ=1𝔼z𝒩(0,1)(ϕ(z))2.subscript𝐶italic-ϕ1subscript𝔼similar-to𝑧𝒩01superscriptitalic-ϕ𝑧2C_{\phi}=\frac{1}{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left(\phi(z)\right)^{2}}\ .italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT ( italic_ϕ ( italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Hence, one has

𝒉(l)𝒔(l1)𝒩(𝟎,σ2nl1(1Cϕ+1)𝐈nl).similar-toconditionalsuperscript𝒉𝑙superscript𝒔𝑙1𝒩0superscript𝜎2subscript𝑛𝑙11subscript𝐶italic-ϕ1subscript𝐈subscript𝑛𝑙\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\sim\mathcal{N}\left(\bm{0},\frac{\sigma^{2}}{n_% {l-1}}\left(\frac{1}{C_{\phi}}+1\right)\mathbf{I}_{n_{l}}\right)\ .bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG + 1 ) bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

Moreover, the NNGP kernel is defined by

KNNGP(l)(𝒔(l1),𝒔(l1))=𝔼𝒉(l)𝒔(l1),𝒉(l)𝒔(l1)=σ2𝔼𝒔(l1),𝒔(l1)+σ2superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1𝔼quantum-operator-productsuperscript𝒉𝑙superscript𝒔𝑙1superscript𝒉𝑙superscript𝒔𝑙1superscript𝜎2𝔼superscript𝒔𝑙1superscript𝒔𝑙1superscript𝜎2K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=% \mathbb{E}\left\langle\bm{h}^{(l)}\mid\bm{s}^{(l-1)},\bm{h}^{(l)}\mid\bm{s}^{% \prime(l-1)}\right\rangle=\sigma^{2}\leavevmode\nobreak\ \mathbb{E}\left% \langle\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right\rangle+\sigma^{2}italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = blackboard_E ⟨ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ⟩ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ⟨ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

with

limnl1𝔼𝒉(l)𝒔(l1),𝒉(l)𝒔(l1)=σ2(1Cϕ+1).subscriptsubscript𝑛𝑙1𝔼quantum-operator-productsuperscript𝒉𝑙superscript𝒔𝑙1superscript𝒉𝑙superscript𝒔𝑙1superscript𝜎21subscript𝐶italic-ϕ1\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\bm{h}^{(l)}\mid\bm{s}^{(l% -1)},\bm{h}^{(l)}\mid\bm{s}^{(l-1)}\right\rangle=\sigma^{2}\left(\frac{1}{C_{% \phi}}+1\right)\ .roman_lim start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT blackboard_E ⟨ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ⟩ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG + 1 ) .

In summary, we conclude the recursive form of the NNGP kernel as follows

KNNGP(l)(𝒔,𝒔)=σ2𝔼𝒔𝒩(𝟎,KNNGP(l1))𝒔,𝒔+σ2.superscriptsubscript𝐾NNGP𝑙𝒔superscript𝒔superscript𝜎2subscript𝔼similar-to𝒔𝒩0superscriptsubscript𝐾NNGP𝑙1𝒔superscript𝒔superscript𝜎2K_{\textrm{NNGP}}^{(l)}\left(\bm{s},\bm{s}^{\prime}\right)=\sigma^{2}% \leavevmode\nobreak\ \mathbb{E}_{\bm{s}\sim\mathcal{N}(\bm{0},K_{\textrm{NNGP}% }^{(l-1)})}\left\langle\bm{s},\bm{s}^{\prime}\right\rangle+\sigma^{2}\ .italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_s ∼ caligraphic_N ( bold_0 , italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ⟨ bold_italic_s , bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Neural Tangent Kernel (NTK). The training of the concerned ANNs consists in optimizing 𝒚=f(𝒙;Θ)𝒚𝑓𝒙Θ\bm{y}=f(\bm{x};\Theta)bold_italic_y = italic_f ( bold_italic_x ; roman_Θ ) in the function space, supervised by a functional loss (Θ)Planck-constant-over-2-piΘ\hbar(\Theta)roman_ℏ ( roman_Θ ), such as the square or cross-entropy functions, where we employ ΘΘ\Thetaroman_Θ to denote the variable of any parameter

dΘdt=d(Θ)dΘ=d(Θ)df(𝒙;Θ)df(𝒙;Θ)dΘ.dΘd𝑡dPlanck-constant-over-2-piΘdΘdPlanck-constant-over-2-piΘd𝑓𝒙Θd𝑓𝒙ΘdΘ\frac{\mathop{}\!\mathrm{d}\Theta}{\mathop{}\!\mathrm{d}t}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}\Theta}=-\frac{\mathop{}\!% \mathrm{d}\hbar(\Theta)}{\mathop{}\!\mathrm{d}f(\bm{x};\Theta)}\frac{\mathop{}% \!\mathrm{d}f(\bm{x};\Theta)}{\mathop{}\!\mathrm{d}\Theta}\ .divide start_ARG roman_d roman_Θ end_ARG start_ARG roman_d italic_t end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG = - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d italic_f ( bold_italic_x ; roman_Θ ) end_ARG divide start_ARG roman_d italic_f ( bold_italic_x ; roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG .

The loss (Θ)Planck-constant-over-2-piΘ\hbar(\Theta)roman_ℏ ( roman_Θ ) is monotonically decreasing as the training epoch t𝑡titalic_t since

(Θ)t=(Θ)ΘΘt=Θ(Θ)Θ(Θ)=Θ(Θ)20.Planck-constant-over-2-piΘ𝑡Planck-constant-over-2-piΘΘΘ𝑡subscriptΘPlanck-constant-over-2-piΘsubscriptΘPlanck-constant-over-2-piΘsuperscriptnormsubscriptΘPlanck-constant-over-2-piΘ20\frac{\partial\hbar(\Theta)}{\partial t}=\frac{\partial\hbar(\Theta)}{\partial% \Theta}\frac{\partial\Theta}{\partial t}=-\nabla_{\Theta}\hbar(\Theta)\cdot% \nabla_{\Theta}\hbar(\Theta)=-\|\nabla_{\Theta}\hbar(\Theta)\|^{2}\leq 0\ .divide start_ARG ∂ roman_ℏ ( roman_Θ ) end_ARG start_ARG ∂ italic_t end_ARG = divide start_ARG ∂ roman_ℏ ( roman_Θ ) end_ARG start_ARG ∂ roman_Θ end_ARG divide start_ARG ∂ roman_Θ end_ARG start_ARG ∂ italic_t end_ARG = - ∇ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT roman_ℏ ( roman_Θ ) ⋅ ∇ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT roman_ℏ ( roman_Θ ) = - ∥ ∇ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT roman_ℏ ( roman_Θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0 .

For any l2𝑙2l\geq 2italic_l ≥ 2, there is a claim that the gradient variable vector 𝒉(l)𝒔(l1)conditionalsuperscript𝒉𝑙superscript𝒔𝑙1\bm{h}^{(l)}\mid\bm{s}^{(l-1)}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT obeys the Gaussian distribution. In detail, for i,j+𝑖𝑗superscripti,j\in\mathbb{N}^{+}italic_i , italic_j ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, one has

Var(𝒉(l)𝐖ij(l1))Varsuperscript𝒉𝑙superscriptsubscript𝐖𝑖𝑗𝑙1\displaystyle\textrm{Var}\left(\frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_% {ij}^{(l-1)}}\right)Var ( divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) =Var(𝐖(l)𝒔(l1)𝐖ij(l1))absentVarsuperscript𝐖𝑙superscript𝒔𝑙1superscriptsubscript𝐖𝑖𝑗𝑙1\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\mathbf{W}_{ij}^{(l-1)}}\right)= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG )
=𝔼(𝐖(l)𝒔(l1)𝐖ij(l1))2[𝔼(𝐖(l)𝒔(l1)𝐖ij(l1))]2absent𝔼superscriptsuperscript𝐖𝑙superscript𝒔𝑙1superscriptsubscript𝐖𝑖𝑗𝑙12superscriptdelimited-[]𝔼superscript𝐖𝑙superscript𝒔𝑙1superscriptsubscript𝐖𝑖𝑗𝑙12\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\mathbf{W}_{ij}^{(l-1)}}\right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{% (l)}\frac{\partial\bm{s}^{(l-1)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\right)% \right]^{2}= blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Var(𝐖(l))𝔼(𝒔(l1)𝐖ij(l1))2absentVarsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙1superscriptsubscript𝐖𝑖𝑗𝑙12\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\right)^{2}= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Var(𝐖(l))𝔼(𝒔(l1)𝒉(l1))2Var(𝒔(l2))absentVarsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙1superscript𝒉𝑙12Varsuperscript𝒔𝑙2\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\right)^{2}\textrm{Var}\left(% \bm{s}^{(l-2)}\right)= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Var ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT )

and

Var(𝒉(l)𝒃i(l1))Varsuperscript𝒉𝑙superscriptsubscript𝒃𝑖𝑙1\displaystyle\textrm{Var}\left(\frac{\partial\bm{h}^{(l)}}{\partial\bm{b}_{i}^% {(l-1)}}\right)Var ( divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) =Var(𝐖(l)𝒔(l1)𝒃i(l1))absentVarsuperscript𝐖𝑙superscript𝒔𝑙1superscriptsubscript𝒃𝑖𝑙1\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\bm{b}_{i}^{(l-1)}}\right)= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG )
=𝔼(𝐖(l)𝒔(l1)𝒃i(l1))2[𝔼(𝐖(l)𝒔(l1)𝒃i(l1))]2absent𝔼superscriptsuperscript𝐖𝑙superscript𝒔𝑙1superscriptsubscript𝒃𝑖𝑙12superscriptdelimited-[]𝔼superscript𝐖𝑙superscript𝒔𝑙1superscriptsubscript𝒃𝑖𝑙12\displaystyle=\mathbb{E}\left(\mathbf{W}^{(l)}\frac{\partial\bm{s}^{(l-1)}}{% \partial\bm{b}_{i}^{(l-1)}}\right)^{2}-\left[\mathbb{E}\left(\mathbf{W}^{(l)}% \frac{\partial\bm{s}^{(l-1)}}{\partial\bm{b}_{i}^{(l-1)}}\right)\right]^{2}= blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ blackboard_E ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Var(𝐖(l))𝔼(𝒔(l1)𝒃i(l1))2absentVarsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙1superscriptsubscript𝒃𝑖𝑙12\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\bm{b}_{i}^{(l-1)}}\right)^{2}= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Var(𝐖(l))𝔼(𝒔(l1)𝒉(l1))2,absentVarsuperscript𝐖𝑙𝔼superscriptsuperscript𝒔𝑙1superscript𝒉𝑙12\displaystyle=\textrm{Var}\left(\mathbf{W}^{(l)}\right)\mathbb{E}\left(\frac{% \partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\right)^{2}\ ,= Var ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) blackboard_E ( divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 𝒔(l1)/𝒉(l1)superscript𝒔𝑙1superscript𝒉𝑙1{\partial\bm{s}^{(l-1)}}/{\partial\bm{h}^{(l-1)}}∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT / ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT denotes the dot operation. Hence, one has

𝒉(l)𝐖ij(l1)𝒩(𝟎,σ2nl1CϕCϕ𝐈nl1)and𝒉(l)𝒃i(l1)𝒩(𝟎,σ2nl1Cϕ𝐈nl1),formulae-sequencesimilar-tosuperscript𝒉𝑙superscriptsubscript𝐖𝑖𝑗𝑙1𝒩0superscript𝜎2subscript𝑛𝑙1subscriptsuperscript𝐶italic-ϕsubscript𝐶italic-ϕsubscript𝐈subscript𝑛𝑙1andsimilar-tosuperscript𝒉𝑙superscriptsubscript𝒃𝑖𝑙1𝒩0superscript𝜎2subscript𝑛𝑙1subscriptsuperscript𝐶italic-ϕsubscript𝐈subscript𝑛𝑙1\frac{\partial\bm{h}^{(l)}}{\partial\mathbf{W}_{ij}^{(l-1)}}\sim\mathcal{N}% \left(\bm{0},\frac{\sigma^{2}}{n_{l-1}C^{\prime}_{\phi}C_{\phi}}\mathbf{I}_{n_% {l-1}}\right)\quad\text{and}\quad\frac{\partial\bm{h}^{(l)}}{\partial\bm{b}_{i% }^{(l-1)}}\sim\mathcal{N}\left(\bm{0},\frac{\sigma^{2}}{n_{l-1}C^{\prime}_{% \phi}}\mathbf{I}_{n_{l-1}}\right)\ ,divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ∼ caligraphic_N ( bold_0 , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ∼ caligraphic_N ( bold_0 , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG bold_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where

Cϕ=1𝔼z𝒩(0,1)[ϕ(z)]2.subscriptsuperscript𝐶italic-ϕ1subscript𝔼similar-to𝑧𝒩01superscriptdelimited-[]superscriptitalic-ϕ𝑧2C^{\prime}_{\phi}=\frac{1}{\mathbb{E}_{z\sim\mathcal{N}(0,1)}\left[\phi^{% \prime}(z)\right]^{2}}\ .italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Moreover, the NTK kernel is defined by

KNTK(l)(𝒔(l1),𝒔(l1))superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) =KNTK(l1)(𝒔(l2),𝒔(l2))𝔼𝒔(l1)𝒉(l1),𝒔(l1)𝒉(l1)+KNNGP(l)(𝒔(l1),𝒔(l1)),absentsuperscriptsubscript𝐾NTK𝑙1superscript𝒔𝑙2superscript𝒔𝑙2𝔼superscript𝒔𝑙1superscript𝒉𝑙1superscript𝒔𝑙1superscript𝒉𝑙1superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1\displaystyle=K_{\textrm{NTK}}^{(l-1)}\left(\bm{s}^{(l-2)},\bm{s}^{\prime(l-2)% }\right)\mathbb{E}\left\langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l% -1)}},\frac{\partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\right% \rangle+K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right% )\ ,= italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 2 ) end_POSTSUPERSCRIPT ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ + italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ,

for l2𝑙2l\geq 2italic_l ≥ 2 and

KNTK(1)(𝒙,𝒙)=KNNGP(1)(𝒙,𝒙),superscriptsubscript𝐾NTK1𝒙superscript𝒙superscriptsubscript𝐾NNGP1𝒙superscript𝒙\displaystyle K_{\textrm{NTK}}^{(1)}\left(\bm{x},\bm{x}^{\prime}\right)=K_{% \textrm{NNGP}}^{(1)}\left(\bm{x},\bm{x}^{\prime}\right)\ ,italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

provided

limnl1𝔼𝒉(l)𝐖ij(l1),𝒉(l)𝐖ij(l1)=σ2CϕCϕandlimnl1𝔼𝒉(l)𝒃i(l1),𝒉(l)𝒃i(l1)=σ2Cϕ.formulae-sequencesubscriptsubscript𝑛𝑙1𝔼superscript𝒉𝑙superscriptsubscript𝐖𝑖𝑗𝑙1superscript𝒉𝑙superscriptsubscript𝐖𝑖𝑗𝑙1superscript𝜎2subscriptsuperscript𝐶italic-ϕsubscript𝐶italic-ϕandsubscriptsubscript𝑛𝑙1𝔼superscript𝒉𝑙superscriptsubscript𝒃𝑖𝑙1superscript𝒉𝑙superscriptsubscript𝒃𝑖𝑙1superscript𝜎2subscriptsuperscript𝐶italic-ϕ\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}% }{\partial\mathbf{W}_{ij}^{(l-1)}},\frac{\partial\bm{h}^{(l)}}{\partial\mathbf% {W}_{ij}^{(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{\prime}_{\phi}C_{\phi}}% \quad\text{and}\quad\lim\limits_{n_{l-1}\to\infty}\mathbb{E}\left\langle\frac{% \partial\bm{h}^{(l)}}{\partial\bm{b}_{i}^{(l-1)}},\frac{\partial\bm{h}^{(l)}}{% \partial\bm{b}_{i}^{(l-1)}}\right\rangle=\frac{\sigma^{2}}{C^{\prime}_{\phi}}\ .roman_lim start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG and roman_lim start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG ⟩ = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG .

Appendix B Full Proof of Theorem 1 and Theorem 2

All statistics of post-synaptic variables 𝒔𝒔\bm{s}bold_italic_s can be calculated via the moment generating function

𝒔(t)=et𝒔f(𝒔)d𝒔.subscript𝒔𝑡superscripte𝑡𝒔𝑓𝒔differential-d𝒔\mathcal{M}_{\bm{s}}(t)=\int\mathop{}\!\mathrm{e}^{t\bm{s}}f(\bm{s})\mathop{}% \!\mathrm{d}\bm{s}\ .caligraphic_M start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( italic_t ) = ∫ roman_e start_POSTSUPERSCRIPT italic_t bold_italic_s end_POSTSUPERSCRIPT italic_f ( bold_italic_s ) roman_d bold_italic_s .

Here, we focus on the second moment of s=𝒔i(l)𝑠subscriptsuperscript𝒔𝑙𝑖s=\bm{s}^{(l)}_{i}italic_s = bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and i[nl]𝑖delimited-[]subscript𝑛𝑙i\in[n_{l}]italic_i ∈ [ italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ], that is,

m2(s,t)=t2s22!f(s)ds=t2s2(Θ)2!fΘ(Θ)ds(Θ)dΘdΘ,subscript𝑚2𝑠𝑡superscript𝑡2superscript𝑠22𝑓𝑠differential-d𝑠superscript𝑡2superscript𝑠2Θ2subscript𝑓ΘΘd𝑠ΘdΘdifferential-dΘm_{2}(s,t)=\int\frac{t^{2}s^{2}}{2!}\leavevmode\nobreak\ f(s)\mathop{}\!% \mathrm{d}s=\int\frac{t^{2}s^{2}(\Theta)}{2!}\leavevmode\nobreak\ f_{\Theta}(% \Theta)\leavevmode\nobreak\ \frac{\mathop{}\!\mathrm{d}s(\Theta)}{\mathop{}\!% \mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta\ ,italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s , italic_t ) = ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ! end_ARG italic_f ( italic_s ) roman_d italic_s = ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Θ ) end_ARG start_ARG 2 ! end_ARG italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( roman_Θ ) divide start_ARG roman_d italic_s ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ ,

In the above equations, s𝑠sitalic_s and ΘΘ\Thetaroman_Θ denote the variables of hidden states and parameters, respectively. Let fΘt()subscript𝑓subscriptΘ𝑡f_{\Theta_{t}}(\cdot)italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) denote the probability density function of ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. According to the formulation of m2(s)subscript𝑚2𝑠m_{2}(s)italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s ), we should compute the probability density function fΘ(Θ)subscript𝑓ΘΘf_{\Theta}(\Theta)italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( roman_Θ ). For convenience, we abbreviate Θ(t)Θ𝑡\Theta(t)roman_Θ ( italic_t ) as ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT throughout this proof.

According to the introduction in Section 3, Eq. (7) has a general updating formulation, taking Eq. (4) as a special case of t=0superscript𝑡0t^{\prime}=0italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0. Hence, we here take a general formula as follows

Θt+dt=Θtd(Θt)dΘtλΘt,subscriptΘ𝑡d𝑡subscriptΘ𝑡dPlanck-constant-over-2-pisubscriptΘ𝑡dsubscriptΘ𝑡𝜆subscriptΘsuperscript𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta_{t})}{\mathop{}\!\mathrm{d}\Theta_{t}}-\lambda\Theta_{t^{\prime}}\ ,roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ,

where dtd𝑡\mathop{}\!\mathrm{d}troman_d italic_t denotes the epoch infinitesimal. Here, we omit the learning rate for simplicity. Thus, we have

fΘt+dt(u)=δ(v)fΘt(x)ft(y)fΘ0(z)dxdydzsubscript𝑓subscriptΘ𝑡d𝑡𝑢triple-integral𝛿𝑣subscript𝑓subscriptΘ𝑡𝑥subscript𝑓subscript𝑡𝑦subscript𝑓subscriptΘ0𝑧differential-d𝑥differential-d𝑦differential-d𝑧f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iiint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)f_{\Theta_{0}}(z)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y% \!\mathop{}\!\mathrm{d}zitalic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) = ∭ italic_δ ( italic_v ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) roman_d italic_x roman_d italic_y roman_d italic_z

with

{fΘt(x)=1σt2πexp(x22σt2)ft(y)=1σy2πexp(y22σy2)fΘ0(z)=1σz2πexp(z22σz2)\left\{\leavevmode\nobreak\ \begin{aligned} f_{\Theta_{t}}(x)&=\frac{1}{\sigma% _{t}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{t}^{2}}\right)\\ f_{\nabla_{t}}(y)&=\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{y^{2}}{2% \sigma_{y}^{2}}\right)\\ f_{\Theta_{0}}(z)&=\frac{1}{\sigma_{z}\sqrt{2\pi}}\exp\left(-\frac{z^{2}}{2% \sigma_{z}^{2}}\right)\\ \end{aligned}\right.{ start_ROW start_CELL italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW

where v=ux+y+λz𝑣𝑢𝑥𝑦𝜆𝑧v=u-x+y+\lambda zitalic_v = italic_u - italic_x + italic_y + italic_λ italic_z, t=d(Θt)/dΘtsubscript𝑡dPlanck-constant-over-2-pisubscriptΘ𝑡dsubscriptΘ𝑡\nabla_{t}={\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}/{\mathop{}\!\mathrm{d}% \Theta_{t}}∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) indicates the Dirac-delta function. Besides, one has

Var(Θt+dt)VarsubscriptΘ𝑡d𝑡\displaystyle\mathrm{Var}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}\right)roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) =Var(ΘttλΘt)absentVarsubscriptΘ𝑡subscript𝑡𝜆subscriptΘsuperscript𝑡\displaystyle\leavevmode\nobreak\ =\textrm{Var}\left(\Theta_{t}-\nabla_{t}-% \lambda\Theta_{t^{\prime}}\right)= Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
=𝔼(ΘttλΘt)2[𝔼(ΘttλΘt)]2absent𝔼superscriptsubscriptΘ𝑡subscript𝑡𝜆subscriptΘsuperscript𝑡2superscriptdelimited-[]𝔼subscriptΘ𝑡subscript𝑡𝜆subscriptΘsuperscript𝑡2\displaystyle\leavevmode\nobreak\ =\mathbb{E}\left(\Theta_{t}-\nabla_{t}-% \lambda\Theta_{t^{\prime}}\right)^{2}-\left[\mathbb{E}\left(\Theta_{t}-\nabla_% {t}-\lambda\Theta_{t^{\prime}}\right)\right]^{2}= blackboard_E ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ blackboard_E ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Var(Θtt)+λ2Var(Θt)+2[𝔼(Θtt)𝔼(λΘt)𝔼((Θtt)λΘt)].absentVarsubscriptΘ𝑡subscript𝑡superscript𝜆2VarsubscriptΘsuperscript𝑡2delimited-[]𝔼subscriptΘ𝑡subscript𝑡𝔼𝜆subscriptΘsuperscript𝑡𝔼subscriptΘ𝑡subscript𝑡𝜆subscriptΘsuperscript𝑡\displaystyle\leavevmode\nobreak\ =\textrm{Var}\left(\Theta_{t}-\nabla_{t}% \right)+\lambda^{2}\textrm{Var}\left(\Theta_{t^{\prime}}\right)+2\left[\mathbb% {E}\left(\Theta_{t}-\nabla_{t}\right)\mathbb{E}\left(\lambda\Theta_{t^{\prime}% }\right)-\mathbb{E}\left((\Theta_{t}-\nabla_{t})\lambda\Theta_{t^{\prime}}% \right)\right]\ .= Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Var ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + 2 [ blackboard_E ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_E ( italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - blackboard_E ( ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_λ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] .

Notice that ΘttsubscriptΘ𝑡subscript𝑡\Theta_{t}-\nabla_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is almost independent to ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as t𝑡t\to\inftyitalic_t → ∞. It is observed that Var(Θt+dt)VarsubscriptΘ𝑡d𝑡\mathrm{Var}(\Theta_{t+\mathop{}\!\mathrm{d}t})roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) converges as n𝑛n\to\inftyitalic_n → ∞ and t𝑡t\to\inftyitalic_t → ∞. Thus, the variable sequence {Var(Θt)}tsubscriptVarsubscriptΘ𝑡𝑡\{\mathrm{Var}(\Theta_{t})\}_{t}{ roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is bounded. Here, we define that

Var(Θt)σt2.VarsubscriptΘ𝑡superscriptsubscript𝜎𝑡2\mathrm{Var}(\Theta_{t})\leq\sigma_{t}^{2}\ .roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Throughout this proof, we have a mild assumption of σ2=maxtσt2=mintσt2superscript𝜎2subscript𝑡superscriptsubscript𝜎𝑡2subscript𝑡superscriptsubscript𝜎𝑡2\sigma^{2}=\max_{t}\sigma_{t}^{2}=\min_{t}\sigma_{t}^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for simplicity; Otherwise, we usually employ 1ρt,t2σtσt1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, instead of the above assumption, where ρt,tsubscript𝜌𝑡superscript𝑡\rho_{t,t^{\prime}}italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the correlation coefficient between variables of hidden states ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

Moreover, we have

fΘt+dt(u)=δ(v)fΘt(x)ft(y)fΘ0(z)dxdydz=x,yfΘt(x)ft(y)dxdyΩzfΘ0(z)dz,subscript𝑓subscriptΘ𝑡d𝑡𝑢triple-integral𝛿𝑣subscript𝑓subscriptΘ𝑡𝑥subscript𝑓subscript𝑡𝑦subscript𝑓subscriptΘ0𝑧differential-d𝑥differential-d𝑦differential-d𝑧subscriptdouble-integral𝑥𝑦subscript𝑓subscriptΘ𝑡𝑥subscript𝑓subscript𝑡𝑦differential-d𝑥differential-d𝑦subscriptsubscriptΩ𝑧subscript𝑓subscriptΘ0𝑧differential-d𝑧f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iiint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)f_{\Theta_{0}}(z)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y% \!\mathop{}\!\mathrm{d}z=\iint_{x,y}f_{\Theta_{t}}(x)f_{\nabla_{t}}(y)\mathop{% }\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y\int_{\Omega_{z}}f_{\Theta_{0}}(z)% \mathop{}\!\mathrm{d}z\ ,italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) = ∭ italic_δ ( italic_v ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) roman_d italic_x roman_d italic_y roman_d italic_z = ∬ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) roman_d italic_x roman_d italic_y ∫ start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) roman_d italic_z ,

where Ωz={(x,y)(u+xy)/λ=0}subscriptΩ𝑧conditional-set𝑥𝑦𝑢𝑥𝑦𝜆0\Omega_{z}=\{(x,y)\mid(-u+x-y)/\lambda=0\}roman_Ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { ( italic_x , italic_y ) ∣ ( - italic_u + italic_x - italic_y ) / italic_λ = 0 }. Thus, we can conjecture that Θt+dtsubscriptΘ𝑡d𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT obeys the Gaussian distribution with zero mean. Suppose that Θt+dt𝒩(0,σt+dt2)similar-tosubscriptΘ𝑡d𝑡𝒩0superscriptsubscript𝜎𝑡d𝑡2\Theta_{t+\mathop{}\!\mathrm{d}t}\sim\mathcal{N}(0,\sigma_{t+\mathop{}\!% \mathrm{d}t}^{2})roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and

fΘt+dt(x)=1σt+dt2πexp(x22σt+dt2).subscript𝑓subscriptΘ𝑡d𝑡𝑥1subscript𝜎𝑡d𝑡2𝜋superscript𝑥22superscriptsubscript𝜎𝑡d𝑡2f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(x)=\frac{1}{\sigma_{t+\mathop{}\!\mathrm% {d}t}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{t+\mathop{}\!\mathrm{d}t}^{2% }}\right)\ .italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

Thus, we have

m2(Θ,t)subscript𝑚2Θ𝑡\displaystyle m_{2}(\Theta,t)italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ , italic_t ) =t2s2(Θ)2!fΘ(Θ)ds(Θ)dΘdΘabsentsuperscript𝑡2superscript𝑠2Θ2subscript𝑓ΘΘd𝑠ΘdΘdifferential-dΘ\displaystyle=\int\frac{t^{2}s^{2}(\Theta)}{2!}\leavevmode\nobreak\ f_{\Theta}% (\Theta)\leavevmode\nobreak\ \frac{\mathop{}\!\mathrm{d}s(\Theta)}{\mathop{}\!% \mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta= ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Θ ) end_ARG start_ARG 2 ! end_ARG italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( roman_Θ ) divide start_ARG roman_d italic_s ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ
=t2s2(Θ)2!1σt+dt2πexp(Θ22σt+dt2)ds(Θ)dΘdΘabsentsuperscript𝑡2superscript𝑠2Θ21subscript𝜎𝑡d𝑡2𝜋superscriptΘ22superscriptsubscript𝜎𝑡d𝑡2d𝑠ΘdΘdifferential-dΘ\displaystyle=\int\frac{t^{2}s^{2}(\Theta)}{2!}\leavevmode\nobreak\ \frac{1}{% \sigma_{t+\mathop{}\!\mathrm{d}t}\sqrt{2\pi}}\exp\left(-\frac{\Theta^{2}}{2% \sigma_{t+\mathop{}\!\mathrm{d}t}^{2}}\right)\frac{\mathop{}\!\mathrm{d}s(% \Theta)}{\mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta= ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Θ ) end_ARG start_ARG 2 ! end_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG roman_Θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG roman_d italic_s ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ
=t22!ϕ2(h(Θ))1σt+dt2πexp(Θ22σt+dt2)dϕ(h(Θ))dΘdΘ,absentsuperscript𝑡22superscriptitalic-ϕ2Θ1subscript𝜎𝑡d𝑡2𝜋superscriptΘ22superscriptsubscript𝜎𝑡d𝑡2ditalic-ϕΘdΘdifferential-dΘ\displaystyle=\int\frac{t^{2}}{2!}\phi^{2}(h(\Theta))\leavevmode\nobreak\ % \frac{1}{\sigma_{t+\mathop{}\!\mathrm{d}t}\sqrt{2\pi}}\exp\left(-\frac{\Theta^% {2}}{2\sigma_{t+\mathop{}\!\mathrm{d}t}^{2}}\right)\frac{\mathop{}\!\mathrm{d}% \phi(h(\Theta))}{\mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta\ ,= ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ! end_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ( roman_Θ ) ) divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG roman_Θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG roman_d italic_ϕ ( italic_h ( roman_Θ ) ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ ,

where h()h(\cdot)italic_h ( ⋅ ) corresponds to 𝒉i(l)()superscriptsubscript𝒉𝑖𝑙\bm{h}_{i}^{(l)}(\cdot)bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ). The above equation can be extended to the vectorized formulation in detail, where provided s=𝒔(l)𝑠superscript𝒔𝑙s=\bm{s}^{(l)}italic_s = bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and h=𝒉(l)superscript𝒉𝑙h=\bm{h}^{(l)}italic_h = bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, one has

m2(𝐖(l),t)=t22!ϕ2(𝐖(l)𝒔(l1)+𝒃(l))12π|𝚺t|exp(𝐖(l).2𝚺t12)dϕ(𝒉(l))d𝒉(l)𝒔(l1)d𝐖(l),m_{2}\left(\mathbf{W}^{(l)},t\right)=\int\frac{t^{2}}{2!}\phi^{2}\left(\mathbf% {W}^{(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\right)\leavevmode\nobreak\ \frac{1}{\sqrt% {2\pi|\mathbf{\Sigma}_{t}|}}\exp\left(-\frac{\mathbf{W}^{(l)}.^{2}\leavevmode% \nobreak\ \mathbf{\Sigma}_{t}^{-1}}{2}\right)\frac{\mathop{}\!\mathrm{d}\phi(% \bm{h}^{(l)})}{\mathop{}\!\mathrm{d}\bm{h}^{(l)}}\bm{s}^{(l-1)}\mathop{}\!% \mathrm{d}\mathbf{W}^{(l)}\ ,italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_t ) = ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ! end_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π | bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG end_ARG roman_exp ( - divide start_ARG bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT . start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) divide start_ARG roman_d italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_d bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT roman_d bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ,
m2(𝒃(l),t)=t22!ϕ2(𝐖(l)𝒔(l1)+𝒃(l))12π|𝚺t|exp(𝒃(l).2𝚺t12)dϕ(𝒉(l))d𝒉(l)𝟏nl×1d𝒃(l),m_{2}\left(\bm{b}^{(l)},t\right)=\int\frac{t^{2}}{2!}\phi^{2}\left(\mathbf{W}^% {(l)}\bm{s}^{(l-1)}+\bm{b}^{(l)}\right)\leavevmode\nobreak\ \frac{1}{\sqrt{2% \pi|\mathbf{\Sigma}_{t}|}}\exp\left(-\frac{\bm{b}^{(l)}.^{2}\leavevmode% \nobreak\ \mathbf{\Sigma}_{t}^{-1}}{2}\right)\frac{\mathop{}\!\mathrm{d}\phi(% \bm{h}^{(l)})}{\mathop{}\!\mathrm{d}\bm{h}^{(l)}}\bm{1}_{n_{l}\times 1}\mathop% {}\!\mathrm{d}\bm{b}^{(l)}\ ,italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_t ) = ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ! end_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π | bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG end_ARG roman_exp ( - divide start_ARG bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT . start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) divide start_ARG roman_d italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_d bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG bold_1 start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT roman_d bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ,

and

m2(Θ,t)=t22!ϕ2(𝒉(l)(Θ))1σt2πexp(Θ22σt2)dϕ(𝒉(l)(Θ))d𝒉(l)(Θ)𝐖(l)d𝒔(l1)(Θ)dΘdΘ,otherwise,subscript𝑚2Θ𝑡superscript𝑡22superscriptitalic-ϕ2superscript𝒉𝑙Θ1subscript𝜎𝑡2𝜋superscriptΘ22superscriptsubscript𝜎𝑡2ditalic-ϕsuperscript𝒉𝑙Θdsuperscript𝒉𝑙Θsuperscript𝐖𝑙dsuperscript𝒔𝑙1ΘdΘdifferential-dΘotherwisem_{2}\left(\Theta,t\right)=\int\frac{t^{2}}{2!}\phi^{2}\left(\bm{h}^{(l)}(% \Theta)\right)\leavevmode\nobreak\ \frac{1}{\sigma_{t}\sqrt{2\pi}}\exp\left(-% \frac{\Theta^{2}}{2\sigma_{t}^{2}}\right)\frac{\mathop{}\!\mathrm{d}\phi(\bm{h% }^{(l)}(\Theta))}{\mathop{}\!\mathrm{d}\bm{h}^{(l)}(\Theta)}\mathbf{W}^{(l)}% \frac{\mathop{}\!\mathrm{d}\bm{s}^{(l-1)}(\Theta)}{\mathop{}\!\mathrm{d}\Theta% }\mathop{}\!\mathrm{d}\Theta\ ,\quad\textrm{otherwise}\ ,italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ , italic_t ) = ∫ divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ! end_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ ) ) divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG roman_Θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG roman_d italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ ) ) end_ARG start_ARG roman_d bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ ) end_ARG bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG roman_d bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ , otherwise ,

where 𝚺tsubscript𝚺𝑡\mathbf{\Sigma}_{t}bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the corresponding variance matrix. Furthermore, provided two stamps t𝑡titalic_t and t+dt𝑡d𝑡t+\mathop{}\!\mathrm{d}titalic_t + roman_d italic_t, we have

𝔼Θt+dt,Θt𝔼subscriptΘ𝑡d𝑡subscriptΘ𝑡\displaystyle\mathbb{E}\left\langle\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t% }\right\rangleblackboard_E ⟨ roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ =m2(Θt+dt,Θt,t+dt,t)absentsubscript𝑚2subscriptΘ𝑡d𝑡subscriptΘ𝑡𝑡d𝑡𝑡\displaystyle=m_{2}(\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t},t+\mathop{}\!% \mathrm{d}t,t)= italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t + roman_d italic_t , italic_t )
=t(t+dt)2!Δ(Θt+dt,Θt,t+dt,t)fΘt+dt,Θt(Θt+dt,Θt)dΘt+dtdΘt,absentdouble-integral𝑡𝑡d𝑡2ΔsubscriptΘ𝑡d𝑡subscriptΘ𝑡𝑡d𝑡𝑡subscript𝑓subscriptΘ𝑡d𝑡subscriptΘ𝑡subscriptΘ𝑡d𝑡subscriptΘ𝑡differential-dsubscriptΘ𝑡d𝑡differential-dsubscriptΘ𝑡\displaystyle=\iint\frac{t(t+\mathop{}\!\mathrm{d}t)}{2!}\Delta\left(\Theta_{t% +\mathop{}\!\mathrm{d}t},\Theta_{t},t+\mathop{}\!\mathrm{d}t,t\right)f_{\Theta% _{t+\mathop{}\!\mathrm{d}t},\Theta_{t}}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}% ,\Theta_{t}\right)\mathop{}\!\mathrm{d}\Theta_{t+\mathop{}\!\mathrm{d}t}% \mathop{}\!\mathrm{d}\Theta_{t}\ ,= ∬ divide start_ARG italic_t ( italic_t + roman_d italic_t ) end_ARG start_ARG 2 ! end_ARG roman_Δ ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t + roman_d italic_t , italic_t ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where

Δ(Θt+dt,Θt,t+dt,t)=ϕ(𝒉(l)(Θt+dt))ϕ(𝒉(l)(Θt))dϕ(𝒉(l)(Θt+dt))dΘt+dtdϕ(𝒉(l)(Θt))dΘtΔsubscriptΘ𝑡d𝑡subscriptΘ𝑡𝑡d𝑡𝑡italic-ϕsuperscript𝒉𝑙subscriptΘ𝑡d𝑡italic-ϕsuperscript𝒉𝑙subscriptΘ𝑡ditalic-ϕsuperscript𝒉𝑙subscriptΘ𝑡d𝑡dsubscriptΘ𝑡d𝑡ditalic-ϕsuperscript𝒉𝑙subscriptΘ𝑡dsubscriptΘ𝑡\Delta\left(\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t},t+\mathop{}\!\mathrm{% d}t,t\right)=\phi\left(\bm{h}^{(l)}(\Theta_{t+\mathop{}\!\mathrm{d}t})\right)% \cdot\phi\left(\bm{h}^{\prime(l)}(\Theta_{t})\right)\cdot\frac{\mathop{}\!% \mathrm{d}\phi(\bm{h}^{(l)}\left(\Theta_{t+\mathop{}\!\mathrm{d}t})\right)}{% \mathop{}\!\mathrm{d}\Theta_{t+\mathop{}\!\mathrm{d}t}}\cdot\frac{\mathop{}\!% \mathrm{d}\phi(\bm{h}^{\prime(l)}\left(\Theta_{t})\right)}{\mathop{}\!\mathrm{% d}\Theta_{t}}roman_Δ ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t + roman_d italic_t , italic_t ) = italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) ) ⋅ italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⋅ divide start_ARG roman_d italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) ) end_ARG start_ARG roman_d roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG roman_d italic_ϕ ( bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG start_ARG roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG

and

fΘt+dt,Θt(Θt+dt,Θt)=12π1ρt+dt,t2exp[12(1ρt+dt,t2)(Θt+dtσt+dtρt+dt,tΘtσt)2],subscript𝑓subscriptΘ𝑡d𝑡subscriptΘ𝑡subscriptΘ𝑡d𝑡subscriptΘ𝑡12𝜋1superscriptsubscript𝜌𝑡d𝑡𝑡2121superscriptsubscript𝜌𝑡d𝑡𝑡2superscriptsubscriptΘ𝑡d𝑡subscript𝜎𝑡d𝑡subscript𝜌𝑡d𝑡𝑡subscriptΘ𝑡subscript𝜎𝑡2f_{\Theta_{t+\mathop{}\!\mathrm{d}t},\Theta_{t}}\left(\Theta_{t+\mathop{}\!% \mathrm{d}t},\Theta_{t}\right)=\frac{1}{2\pi\sqrt{1-\rho_{t+\mathop{}\!\mathrm% {d}t,t}^{2}}}\exp\left[\frac{-1}{2(1-\rho_{t+\mathop{}\!\mathrm{d}t,t}^{2})}% \left(\frac{\Theta_{t+\mathop{}\!\mathrm{d}t}}{\sigma_{t+\mathop{}\!\mathrm{d}% t}}-\rho_{t+\mathop{}\!\mathrm{d}t,t}\frac{\Theta_{t}}{\sigma_{t}}\right)^{2}% \right]\ ,italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t + roman_d italic_t , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG roman_exp [ divide start_ARG - 1 end_ARG start_ARG 2 ( 1 - italic_ρ start_POSTSUBSCRIPT italic_t + roman_d italic_t , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ( divide start_ARG roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_ARG - italic_ρ start_POSTSUBSCRIPT italic_t + roman_d italic_t , italic_t end_POSTSUBSCRIPT divide start_ARG roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

in which ρt+dt,tsubscript𝜌𝑡d𝑡𝑡\rho_{t+\mathop{}\!\mathrm{d}t,t}italic_ρ start_POSTSUBSCRIPT italic_t + roman_d italic_t , italic_t end_POSTSUBSCRIPT denotes the correlation coefficient between Θt+dtsubscriptΘ𝑡d𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT and ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The estimation of the second moment has been written as a general formula, which can be solved by some mature statistical methods, such as the replica calculation [16].

By direct calculations, we can obtain the concerned kernel

KUNK(l)(t,t,𝒔(l1),𝒔(l1))=exp((tt)|λ|σ2)𝔼𝒉(l)(Θt)Θt,𝒉(l)(Θt)Θt,superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscript𝑡𝑡𝜆superscript𝜎2𝔼superscript𝒉𝑙subscriptΘ𝑡subscriptΘ𝑡superscript𝒉𝑙subscriptΘsuperscript𝑡subscriptΘsuperscript𝑡K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sigma^{% 2}}\right)\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}(\Theta_{t})}{% \partial\Theta_{t}},\frac{\partial\bm{h}^{\prime(l)}(\Theta_{t^{\prime}})}{% \partial\Theta_{t^{\prime}}}\right\rangle\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ ,

or

KUNK(l)(t,t,𝒔(l1),𝒔(l1))=exp((tt)|λ|1ρt,t2σtσt)𝔼𝒉(l)(Θt)Θt,𝒉(l)(Θt)Θt,superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡𝔼superscript𝒉𝑙subscriptΘ𝑡subscriptΘ𝑡superscript𝒉𝑙subscriptΘsuperscript𝑡subscriptΘsuperscript𝑡K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}% \right)=\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\mathbb{E}\left% \langle\frac{\partial\bm{h}^{(l)}(\Theta_{t})}{\partial\Theta_{t}},\frac{% \partial\bm{h}^{\prime(l)}(\Theta_{t^{\prime}})}{\partial\Theta_{t^{\prime}}}% \right\rangle\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ ,

for σ21ρt,t2σtσtsuperscript𝜎21superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡\sigma^{2}\neq\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≠ square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Here, 𝒔(l1)superscript𝒔𝑙1\bm{s}^{(l-1)}bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT and 𝒔(l1)superscript𝒔𝑙1\bm{s}^{\prime(l-1)}bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT are variables led by ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ΘtsubscriptΘsuperscript𝑡\Theta_{t^{\prime}}roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, respectively. Similar to the NNGP and NTK kernels, the unified kernel is also of a recursive form as follows:

KUNK(l)(t,t,𝒔(l1),𝒔(l1))=superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1absent\displaystyle K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{% \prime(l-1)}\right)=italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = KUNK(l1)(t,t,𝒔(l2),𝒔(l2))𝔼𝒔(l1)𝒉(l1)|Θt,𝒔(l1)𝒉(l1)|Θtsuperscriptsubscript𝐾UNK𝑙1𝑡superscript𝑡superscript𝒔𝑙2superscript𝒔𝑙2𝔼evaluated-atsuperscript𝒔𝑙1superscript𝒉𝑙1subscriptΘ𝑡evaluated-atsuperscript𝒔𝑙1superscript𝒉𝑙1subscriptΘsuperscript𝑡\displaystyle\leavevmode\nobreak\ K_{\textrm{UNK}}^{(l-1)}\left(t,t^{\prime},% \bm{s}^{(l-2)},\bm{s}^{\prime(l-2)}\right)\mathbb{E}\left\langle\frac{\partial% \bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\Big{|}_{\Theta_{t}},\frac{\partial\bm{% s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\Big{|}_{\Theta_{t^{\prime}}}\right\rangleitalic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 2 ) end_POSTSUPERSCRIPT ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ (15)
+exp((tt)|λ|1ρt,t2σtσt)KNNGP(l)(𝒔(l1)(Θt),𝒔(l1)(Θt)).superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1subscriptΘ𝑡superscript𝒔𝑙1subscriptΘsuperscript𝑡\displaystyle+\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{% \sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)K_{% \textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(% \Theta_{t^{\prime}})\right)\ .+ roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) .

Next, we will analyze the limiting properties of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT.

  • In the case of λ=0𝜆0\lambda=0italic_λ = 0, it is obvious that

    exp((tt)|λ=0|1ρt,t2σtσt)=1,\exp\left(\frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda=0|}{\sqrt{1-\rho_{% t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)=1\ ,roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ = 0 | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) = 1 ,

    and thus, Eq. (5) is degenerated as the NTK kernel

    KUNK(l)(t,t,𝒔(l1),𝒔(l1);λ=0)=KNTK(l)(𝒔(l1),𝒔(l1)).superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1𝜆0superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)};% \lambda=0\right)=K_{\textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1% )}\right)\ .italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ; italic_λ = 0 ) = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

    We provide another proof that originates from Eq. (4) with λ=0𝜆0\lambda=0italic_λ = 0 in Appendix C.

  • In the case of λ0𝜆0\lambda\neq 0italic_λ ≠ 0 and t=t𝑡superscript𝑡t=t^{\prime}italic_t = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, one has

    exp((tt)|λ|1ρt,t2σtσt)=1,𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡1\exp\left(\frac{(t-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-\rho_{t,t^{\prime% }}^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)=1\ ,roman_exp ( divide start_ARG ( italic_t - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) = 1 ,

    and thus, Eq. (5) equals the NTK kernel

    KUNK(l)(t,t,𝒔(l1),𝒔(l1))=KNTK(l)(𝒔(l1),𝒔(l1)).superscriptsubscript𝐾UNK𝑙𝑡𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NTK𝑙superscript𝒔𝑙1superscript𝒔𝑙1K_{\textrm{UNK}}^{(l)}\left(t,t,\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=K_{% \textrm{NTK}}^{(l)}\left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT NTK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .
  • In the case of λ0𝜆0\lambda\neq 0italic_λ ≠ 0 and tt𝑡superscript𝑡t-t^{\prime}\to\inftyitalic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞, we conjecture that

    limttKUNK(l)(t,t,𝒔(l1),𝒔(l1))KNNGP(l)(𝒔(l1),𝒔(l1)).subscript𝑡superscript𝑡superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1\lim\limits_{t-t^{\prime}\to\infty}K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},% \bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\to K_{\textrm{NNGP}}^{(l)}\left(\bm% {s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .roman_lim start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞ end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) → italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

    According to Eq. (15), one has

    ttKUNK(l)superscriptsubscriptsuperscript𝑡𝑡superscriptsubscript𝐾UNK𝑙\displaystyle\int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l)}∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT (t,t,𝒔(l1),𝒔(l1))dt𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1d𝑡\displaystyle\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)% \mathop{}\!\mathrm{d}t( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) roman_d italic_t
    =\displaystyle== ttKUNK(l1)(t,t,𝒔(l2),𝒔(l2))𝔼𝒔(l1)𝒉(l1)|Θt,𝒔(l1)𝒉(l1)|Θtdtsuperscriptsubscriptsuperscript𝑡𝑡superscriptsubscript𝐾UNK𝑙1𝑡superscript𝑡superscript𝒔𝑙2superscript𝒔𝑙2𝔼evaluated-atsuperscript𝒔𝑙1superscript𝒉𝑙1subscriptΘ𝑡evaluated-atsuperscript𝒔𝑙1superscript𝒉𝑙1subscriptΘsuperscript𝑡differential-d𝑡\displaystyle\leavevmode\nobreak\ \int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l-1)% }\left(t,t^{\prime},\bm{s}^{(l-2)},\bm{s}^{\prime(l-2)}\right)\mathbb{E}\left% \langle\frac{\partial\bm{s}^{(l-1)}}{\partial\bm{h}^{(l-1)}}\Big{|}_{\Theta_{t% }},\frac{\partial\bm{s}^{\prime(l-1)}}{\partial\bm{h}^{\prime(l-1)}}\Big{|}_{% \Theta_{t^{\prime}}}\right\rangle\mathop{}\!\mathrm{d}t∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 2 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 2 ) end_POSTSUPERSCRIPT ) blackboard_E ⟨ divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , divide start_ARG ∂ bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ roman_d italic_t
    +ttexp((tt)|λ|1ρt,t2σtσt)KNNGP(l)(𝒔(l1)(Θt),𝒔(l1)(Θt))dtsuperscriptsubscriptsuperscript𝑡𝑡superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1subscriptΘ𝑡superscript𝒔𝑙1subscriptΘsuperscript𝑡differential-d𝑡\displaystyle+\int_{t^{\prime}}^{t}\exp\left(\frac{(t^{\prime}-t)\leavevmode% \nobreak\ |\lambda|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{% \prime}}}\right)K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}% ^{\prime(l-1)}(\Theta_{t^{\prime}})\right)\mathop{}\!\mathrm{d}t+ ∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) roman_d italic_t
    =\displaystyle== tt[1ρt,t2σtσt|λ|exp((tt)|λ|1ρt,t2σtσt)]tKNNGP(l)(𝒔(l1)(Θt),𝒔(l1)(Θt))dtsuperscriptsubscriptsuperscript𝑡𝑡subscriptdelimited-[]1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡𝜆superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡𝑡superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1subscriptΘ𝑡superscript𝒔𝑙1subscriptΘsuperscript𝑡differential-d𝑡\displaystyle\leavevmode\nobreak\ \int_{t^{\prime}}^{t}\left[\frac{\sqrt{1-% \rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}}{|\lambda|}\exp\left(% \frac{(t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-\rho_{t,t^{\prime}% }^{2}}\sigma_{t}\sigma_{t^{\prime}}}\right)\right]_{\partial t}K_{\textrm{NNGP% }}^{(l)}\left(\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(\Theta_{t^{% \prime}})\right)\mathop{}\!\mathrm{d}t∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ divide start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | italic_λ | end_ARG roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) ] start_POSTSUBSCRIPT ∂ italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) roman_d italic_t
    +ttexp((tt)|λ|1ρt,t2σtσt)[σ2|λ|KNNGP(l)(𝒔(l1)(Θt),𝒔(l1)(Θt))]tdtsuperscriptsubscriptsuperscript𝑡𝑡superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡subscriptdelimited-[]superscript𝜎2𝜆superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1subscriptΘ𝑡superscript𝒔𝑙1subscriptΘsuperscript𝑡𝑡differential-d𝑡\displaystyle+\int_{t^{\prime}}^{t}\exp\left(\frac{(t^{\prime}-t)\leavevmode% \nobreak\ |\lambda|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{% \prime}}}\right)\left[\frac{\sigma^{2}}{|\lambda|}K_{\textrm{NNGP}}^{(l)}\left% (\bm{s}^{(l-1)}(\Theta_{t}),\bm{s}^{\prime(l-1)}(\Theta_{t^{\prime}})\right)% \right]_{\partial t}\mathop{}\!\mathrm{d}t+ ∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) [ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_λ | end_ARG italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) ] start_POSTSUBSCRIPT ∂ italic_t end_POSTSUBSCRIPT roman_d italic_t
    =\displaystyle== 1ρt,t2σtσt|λ|tt[exp((tt)|λ|1ρt,t2σtσt)KNNGP(l)(𝒔(l1)(Θt),𝒔(l1)(Θt))]tdt,1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡𝜆superscriptsubscriptsuperscript𝑡𝑡subscriptdelimited-[]superscript𝑡𝑡𝜆1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1subscriptΘ𝑡superscript𝒔𝑙1subscriptΘsuperscript𝑡𝑡differential-d𝑡\displaystyle\leavevmode\nobreak\ \frac{\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma% _{t}\sigma_{t^{\prime}}}{|\lambda|}\int_{t^{\prime}}^{t}\left[\exp\left(\frac{% (t^{\prime}-t)\leavevmode\nobreak\ |\lambda|}{\sqrt{1-\rho_{t,t^{\prime}}^{2}}% \sigma_{t}\sigma_{t^{\prime}}}\right)K_{\textrm{NNGP}}^{(l)}\left(\bm{s}^{(l-1% )}(\Theta_{t}),\bm{s}^{\prime(l-1)}(\Theta_{t^{\prime}})\right)\right]_{% \partial t}\mathop{}\!\mathrm{d}t\ ,divide start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | italic_λ | end_ARG ∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ roman_exp ( divide start_ARG ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ) | italic_λ | end_ARG start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) ] start_POSTSUBSCRIPT ∂ italic_t end_POSTSUBSCRIPT roman_d italic_t ,

    where []tsubscriptdelimited-[]𝑡[\cdot]_{\partial t}[ ⋅ ] start_POSTSUBSCRIPT ∂ italic_t end_POSTSUBSCRIPT denotes the differential operation with respect to t𝑡titalic_t. Thus, for any tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it is easy to prove that

    limtttKUNK(l)(t,t,𝒔(l1),𝒔(l1))dt=1ρt,t2σ2|λ|KNNGP(l)(𝒔(l1),𝒔(l1)).subscript𝑡superscriptsubscriptsuperscript𝑡𝑡superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1differential-d𝑡1superscriptsubscript𝜌𝑡superscript𝑡2superscript𝜎2𝜆superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1\lim\limits_{t\to\infty}\int_{t^{\prime}}^{t}K_{\textrm{UNK}}^{(l)}\left(t,t^{% \prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\mathop{}\!\mathrm{d}t=\frac% {\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma^{2}}{|\lambda|}K_{\textrm{NNGP}}^{(l)}% \left(\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ .roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) roman_d italic_t = divide start_ARG square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_λ | end_ARG italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) .

    Here, we consider that the correlation coefficient ρt,tsubscript𝜌𝑡superscript𝑡\rho_{t,t^{\prime}}italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is negatively proportional to tt𝑡superscript𝑡t-t^{\prime}italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT since the variable correlation becomes smaller as the stamp gap increases. Generally, we employ

    ρt,t=𝚯(1tt)andlimttρt,ttt=C.formulae-sequencesubscript𝜌𝑡superscript𝑡𝚯1𝑡superscript𝑡andsubscript𝑡superscript𝑡subscript𝜌𝑡superscript𝑡𝑡superscript𝑡𝐶\rho_{t,t^{\prime}}=\mathbf{\Theta}\left(\frac{1}{t-t^{\prime}}\right)\quad% \textrm{and}\quad\lim\limits_{t-t^{\prime}\to\infty}\frac{\rho_{t,t^{\prime}}}% {t-t^{\prime}}=C\in\mathbb{R}\ .italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = bold_Θ ( divide start_ARG 1 end_ARG start_ARG italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) and roman_lim start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞ end_POSTSUBSCRIPT divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = italic_C ∈ blackboard_R .

    Thus, we can obtain

    limttKUNK(l)(t,t,𝒔(l1),𝒔(l1))=KNNGP(l)(𝒔(l1),𝒔(l1)),subscript𝑡superscript𝑡superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1superscriptsubscript𝐾NNGP𝑙superscript𝒔𝑙1superscript𝒔𝑙1\lim\limits_{t-t^{\prime}\to\infty}K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},% \bm{s}^{(l-1)},\bm{s}^{\prime(l-1)}\right)=K_{\textrm{NNGP}}^{(l)}\left(\bm{s}% ^{(l-1)},\bm{s}^{\prime(l-1)}\right)\ ,roman_lim start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → ∞ end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT NNGP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ,

    in which we omit the constant multiplier.

Considering the mild assumption of σ2=maxtσt2=mintσt2superscript𝜎2subscript𝑡superscriptsubscript𝜎𝑡2subscript𝑡superscriptsubscript𝜎𝑡2\sigma^{2}=\max_{t}\sigma_{t}^{2}=\min_{t}\sigma_{t}^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, as mentioned above, we can further simplify these conclusions from

1ρt,t2σtσtσ2astt+.formulae-sequence1superscriptsubscript𝜌𝑡superscript𝑡2subscript𝜎𝑡subscript𝜎superscript𝑡superscript𝜎2as𝑡superscript𝑡superscript\sqrt{1-\rho_{t,t^{\prime}}^{2}}\sigma_{t}\sigma_{t^{\prime}}\to\sigma^{2}% \quad\textrm{as}\quad t-t^{\prime}\in\mathbb{R}^{+}\ .square-root start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT → italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT .

This completes the proof. \hfill\square

Appendix C For the case of λ=0𝜆0\lambda=0italic_λ = 0

For the case of λ=0𝜆0\lambda=0italic_λ = 0, we can update ΘΘ\Thetaroman_Θ from

Θt+dt=Θtd(Θ)dΘ|t.subscriptΘ𝑡d𝑡subscriptΘ𝑡evaluated-atdPlanck-constant-over-2-piΘdΘ𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}=\Theta_{t}-\frac{\mathop{}\!\mathrm{d}\hbar(% \Theta)}{\mathop{}\!\mathrm{d}\Theta}\Big{|}_{t}\ .roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG roman_d roman_ℏ ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG | start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Here, we omit the learning rate for simplicity. For convenience, we abbreviate Θ(t)Θ𝑡\Theta(t)roman_Θ ( italic_t ) as ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It is observed that

Var(Θt+dt)=Var(Θtt)=𝔼(Θtt)2[𝔼(Θtt)]2.VarsubscriptΘ𝑡d𝑡VarsubscriptΘ𝑡subscript𝑡𝔼superscriptsubscriptΘ𝑡subscript𝑡2superscriptdelimited-[]𝔼subscriptΘ𝑡subscript𝑡2\mathrm{Var}\left(\Theta_{t+\mathop{}\!\mathrm{d}t}\right)=\textrm{Var}\left(% \Theta_{t}-\nabla_{t}\right)=\mathbb{E}\left(\Theta_{t}-\nabla_{t}\right)^{2}-% \left[\mathbb{E}\left(\Theta_{t}-\nabla_{t}\right)\right]^{2}\ .roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) = Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ blackboard_E ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

It is observed that Var(Θt+dt)VarsubscriptΘ𝑡d𝑡\mathrm{Var}(\Theta_{t+\mathop{}\!\mathrm{d}t})roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ) converges as n𝑛n\to\inftyitalic_n → ∞ and t𝑡t\to\inftyitalic_t → ∞. Thus, the variable sequence {Var(Θt)}tsubscriptVarsubscriptΘ𝑡𝑡\{\mathrm{Var}(\Theta_{t})\}_{t}{ roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is bounded. Here, we define that

Var(Θt)σt2andσ2=maxtσt2.formulae-sequenceVarsubscriptΘ𝑡superscriptsubscript𝜎𝑡2andsuperscript𝜎2subscript𝑡superscriptsubscript𝜎𝑡2\mathrm{Var}(\Theta_{t})\leq\sigma_{t}^{2}\quad\text{and}\quad\sigma^{2}=\max_% {t}\leavevmode\nobreak\ \sigma_{t}^{2}\ .roman_Var ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Let fΘt()subscript𝑓subscriptΘ𝑡f_{\Theta_{t}}(\cdot)italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) denote the probability density function of Θ(t)Θ𝑡\Theta(t)roman_Θ ( italic_t ). Thus, we have

fΘt+dt(u)=δ(v)fΘt(x)ft(y)dxdysubscript𝑓subscriptΘ𝑡d𝑡𝑢double-integral𝛿𝑣subscript𝑓subscriptΘ𝑡𝑥subscript𝑓subscript𝑡𝑦differential-d𝑥differential-d𝑦f_{\Theta_{t+\mathop{}\!\mathrm{d}t}}(u)=\iint\delta(v)f_{\Theta_{t}}(x)f_{% \nabla_{t}}(y)\mathop{}\!\mathrm{d}x\!\mathop{}\!\mathrm{d}y\!italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u ) = ∬ italic_δ ( italic_v ) italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) roman_d italic_x roman_d italic_y

with

{fΘt(x)=1σx2πexp(x22σx2)ft(y)=1σy2πexp(y22σy2)\left\{\leavevmode\nobreak\ \begin{aligned} f_{\Theta_{t}}(x)&=\frac{1}{\sigma% _{x}\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2\sigma_{x}^{2}}\right)\\ f_{\nabla_{t}}(y)&=\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{y^{2}}{2% \sigma_{y}^{2}}\right)\\ \end{aligned}\right.{ start_ROW start_CELL italic_f start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW

where v=ux+y𝑣𝑢𝑥𝑦v=u-x+yitalic_v = italic_u - italic_x + italic_y, t=d(Θt)/dΘtsubscript𝑡dPlanck-constant-over-2-pisubscriptΘ𝑡dsubscriptΘ𝑡\nabla_{t}={\mathop{}\!\mathrm{d}\hbar(\Theta_{t})}/{\mathop{}\!\mathrm{d}% \Theta_{t}}∇ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_d roman_ℏ ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / roman_d roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) indicates the Dirac-delta function. Thus, it is feasible to conjecture that Θt+dtsubscriptΘ𝑡d𝑡\Theta_{t+\mathop{}\!\mathrm{d}t}roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT obeys the Gaussian distribution with zero mean. We define Θt+dt𝒩(0,σu2)similar-tosubscriptΘ𝑡d𝑡𝒩0superscriptsubscript𝜎𝑢2\Theta_{t+\mathop{}\!\mathrm{d}t}\sim\mathcal{N}(0,\sigma_{u}^{2})roman_Θ start_POSTSUBSCRIPT italic_t + roman_d italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Thus, the second moment in 𝒔()subscript𝒔\mathcal{M}_{\bm{s}}(\cdot)caligraphic_M start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( ⋅ ) becomes

m2(s)=s2f(s)ds=s2(Θ)1σu2πexp(s2(Θ)2σu2)ds(Θ)dΘdΘ,subscript𝑚2𝑠superscript𝑠2𝑓𝑠differential-d𝑠superscript𝑠2Θ1subscript𝜎𝑢2𝜋superscript𝑠2Θ2superscriptsubscript𝜎𝑢2d𝑠ΘdΘdifferential-dΘm_{2}(s)=\int s^{2}\leavevmode\nobreak\ f(s)\mathop{}\!\mathrm{d}s=\int s^{2}(% \Theta)\leavevmode\nobreak\ \frac{1}{\sigma_{u}\sqrt{2\pi}}\exp\left(-\frac{s^% {2}(\Theta)}{2\sigma_{u}^{2}}\right)\frac{\mathop{}\!\mathrm{d}s(\Theta)}{% \mathop{}\!\mathrm{d}\Theta}\mathop{}\!\mathrm{d}\Theta\ ,italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s ) = ∫ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_s ) roman_d italic_s = ∫ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Θ ) divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Θ ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG roman_d italic_s ( roman_Θ ) end_ARG start_ARG roman_d roman_Θ end_ARG roman_d roman_Θ ,

where s=𝒔i(l)𝑠subscriptsuperscript𝒔𝑙𝑖s=\bm{s}^{(l)}_{i}italic_s = bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ] and i[nl]𝑖delimited-[]subscript𝑛𝑙i\in[n_{l}]italic_i ∈ [ italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ]. Based on the above equations, we can obtain the concerned kernel

KUNK(l)(t,t,𝒔(l1),𝒔(l1);λ=0)=𝔼𝒉(l)(Θt)Θt,𝒉(l)(Θt)Θt,superscriptsubscript𝐾UNK𝑙𝑡superscript𝑡superscript𝒔𝑙1superscript𝒔𝑙1𝜆0𝔼superscript𝒉𝑙subscriptΘ𝑡subscriptΘ𝑡superscript𝒉𝑙subscriptΘsuperscript𝑡subscriptΘsuperscript𝑡K_{\textrm{UNK}}^{(l)}\left(t,t^{\prime},\bm{s}^{(l-1)},\bm{s}^{\prime(l-1)};% \lambda=0\right)=\mathbb{E}\left\langle\frac{\partial\bm{h}^{(l)}(\Theta_{t})}% {\partial\Theta_{t}},\frac{\partial\bm{h}^{\prime(l)}(\Theta_{t^{\prime}})}{% \partial\Theta_{t^{\prime}}}\right\rangle\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ ( italic_l - 1 ) end_POSTSUPERSCRIPT ; italic_λ = 0 ) = blackboard_E ⟨ divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_h start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ roman_Θ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ ,

which coincides with the theory of NTK and our proposed unified kernel. \hfill\square

Appendix D Uniform Tightness of KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT

Lemma 4.4 can be straightforwardly derived from Kolmogorov Continuity Theorem [25], provided the Polish space (,||)(\mathbb{R},|\cdot|)( blackboard_R , | ⋅ | ).

D.1 Full Proof of Lemma 4.5

It suffices to prove that

  • 1)

    𝒙=𝟎𝒙0\bm{x}=\bm{0}bold_italic_x = bold_0 is a tight point of 𝒔t(𝒙)subscript𝒔𝑡𝒙\bm{s}_{t}(\bm{x})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) (t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]) in 𝒞(n0,)𝒞superscriptsubscript𝑛0\mathcal{C}(\mathbb{R}^{n_{0}},\mathbb{R})caligraphic_C ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , blackboard_R ). This conjecture is self-evident since every probability measure in (,||)(\mathbb{R},|\cdot|)( blackboard_R , | ⋅ | ) is tight [29].

  • 2)

    The statistic (𝒔1(𝟎)++𝒔t(𝟎))/tsubscript𝒔10subscript𝒔𝑡0𝑡(\bm{s}_{1}(\bm{0})+\dots+\bm{s}_{t}(\bm{0}))/t( bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_0 ) + ⋯ + bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_0 ) ) / italic_t converges in distribution as t𝑡t\to\inftyitalic_t → ∞. This conjecture has been proved by Theorem 2.

Therefore, we finish the proof of this lemma. \hfill\square

D.2 Full Proof of Lemma 4.6

This proof follows mathematical induction. Before that, we show the following preliminary result. Let θ𝜃\thetaitalic_θ be one element of the augmented matrix (𝐖(l),𝒃(l))superscript𝐖𝑙superscript𝒃𝑙(\mathbf{W}^{(l)},\bm{b}^{(l)})( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) at the l𝑙litalic_l-th layer, then we can formulate its characteristic function as

φ(t)=𝔼[eiθt]=eη2t2/2withθ𝒩(0,η2),formulae-sequence𝜑𝑡𝔼delimited-[]superscriptei𝜃𝑡superscriptesuperscript𝜂2superscript𝑡22similar-towith𝜃𝒩0superscript𝜂2\varphi(t)=\mathbb{E}\left[\mathop{}\!\mathrm{e}^{\mathrm{i}\theta t}\right]=% \mathop{}\!\mathrm{e}^{-\eta^{2}t^{2}/2}\quad\text{with}\quad\theta\sim% \mathcal{N}(0,\eta^{2})\ ,italic_φ ( italic_t ) = blackboard_E [ roman_e start_POSTSUPERSCRIPT roman_i italic_θ italic_t end_POSTSUPERSCRIPT ] = roman_e start_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT with italic_θ ∼ caligraphic_N ( 0 , italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where ii\mathrm{i}roman_i denotes the imaginary unit with i=1i1\mathrm{i}=\sqrt{-1}roman_i = square-root start_ARG - 1 end_ARG. Thus, the variance of hidden random variables at the l𝑙litalic_l-th layer becomes

σl2=η2[1+1nlφ𝒔(l1)].subscriptsuperscript𝜎2𝑙superscript𝜂2delimited-[]11subscript𝑛𝑙norm𝜑superscript𝒔𝑙1\sigma^{2}_{l}=\eta^{2}\left[1+\frac{1}{n_{l}}\big{\|}\varphi\circ\bm{s}^{(l-1% )}\big{\|}\right]\ .italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ 1 + divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ∥ italic_φ ∘ bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∥ ] . (16)

Next, we provide two useful definitions from [28].

Definition D.8

A function ϕ::italic-ϕ\phi:\mathbb{R}\to\mathbb{R}italic_ϕ : blackboard_R → blackboard_R is said to be well-posed, if ϕitalic-ϕ\phiitalic_ϕ is first-order differentiable, and its derivative is bounded by a certain constant Cϕsubscript𝐶italic-ϕC_{\phi}italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. In particular, the commonly used activation functions like ReLU, tanh, and sigmoid are well-posed (see Table 2).

Table 2: Well-posedness of the commonly-used activation functions.
Activations ϕitalic-ϕ\phiitalic_ϕ Well-Posedness
ReLU ϕ(𝒙)1normsuperscriptitalic-ϕ𝒙1\|\phi^{\prime}(\bm{x})\|\leq 1∥ italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x ) ∥ ≤ 1
tanh\tanhroman_tanh ϕ(𝒙)=1σ2(𝒙)1normsuperscriptitalic-ϕ𝒙norm1superscript𝜎2𝒙1\|\phi^{\prime}(\bm{x})\|=\|1-\sigma^{2}(\bm{x})\|\leq 1∥ italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x ) ∥ = ∥ 1 - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) ∥ ≤ 1
sigmoid ϕ(𝒙)=ϕ(𝒙)(1ϕ(𝒙))0.25normsuperscriptitalic-ϕ𝒙normitalic-ϕ𝒙1italic-ϕ𝒙0.25\|\phi^{\prime}(\bm{x})\|=\|\phi(\bm{x})(1-\phi(\bm{x}))\|\leq 0.25∥ italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x ) ∥ = ∥ italic_ϕ ( bold_italic_x ) ( 1 - italic_ϕ ( bold_italic_x ) ) ∥ ≤ 0.25
Definition D.9

A matrix 𝐖𝐖\mathbf{W}bold_W is said to be stable-pertinent for a well-posed activation function ϕitalic-ϕ\phiitalic_ϕ, in short 𝐖SP(ϕ)𝐖𝑆𝑃italic-ϕ\mathbf{W}\in SP(\phi)bold_W ∈ italic_S italic_P ( italic_ϕ ), if the inequality Cϕ𝐖<1subscript𝐶italic-ϕnorm𝐖1C_{\phi}\|\mathbf{W}\|<1italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∥ bold_W ∥ < 1 holds.

Since the activation ϕitalic-ϕ\phiitalic_ϕ is a well-posed function and (𝐖(l),𝒃(l))SP(ϕ)superscript𝐖𝑙superscript𝒃𝑙𝑆𝑃italic-ϕ(\mathbf{W}^{(l)},\bm{b}^{(l)})\in SP(\phi)( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∈ italic_S italic_P ( italic_ϕ ), we affirm that ϕitalic-ϕ\phiitalic_ϕ is Lipschitz continuous (with Lipschitz constant Lϕsubscript𝐿italic-ϕL_{\phi}italic_L start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT). Now, we start the mathematical induction. When t=1𝑡1t=1italic_t = 1, for any 𝒙,𝒙n0𝒙superscript𝒙superscriptsubscript𝑛0\bm{x},\bm{x}^{\prime}\in\mathbb{R}^{n_{0}}bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we have

𝔼[𝒔t(𝒙)𝒔t(𝒙)αsup]Cη,θ,α𝒙𝒙α,𝔼delimited-[]superscriptsubscriptnormsubscript𝒔𝑡𝒙subscript𝒔𝑡superscript𝒙𝛼supsubscript𝐶𝜂𝜃𝛼subscriptnorm𝒙superscript𝒙𝛼\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C_{\eta,% \theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ ,blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ,

where Cη,θ,α=ηα𝔼[|𝒩(0,1)|α]subscript𝐶𝜂𝜃𝛼superscript𝜂𝛼𝔼delimited-[]superscript𝒩01𝛼C_{\eta,\theta,\alpha}=\eta^{\alpha}\leavevmode\nobreak\ \mathbb{E}[|\mathcal{% N}(0,1)|^{\alpha}]italic_C start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT blackboard_E [ | caligraphic_N ( 0 , 1 ) | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ]. Per mathematical induction, for t1𝑡1t\geq 1italic_t ≥ 1, we have

𝔼[𝒔t(𝒙)𝒔t(𝒙)αsup]Cη,θ,α𝒙𝒙α.𝔼delimited-[]superscriptsubscriptnormsubscript𝒔𝑡𝒙subscript𝒔𝑡superscript𝒙𝛼supsubscript𝐶𝜂𝜃𝛼subscriptnorm𝒙superscript𝒙𝛼\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C_{\eta,% \theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ .blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT .

Thus, one has

𝔼[𝒔t(𝒙)𝒔t(𝒙)αsup](Cϕ)αnl𝔼[|𝒩(0,1)|α]𝒔t1(𝒙)𝒔t1(𝒙)α,𝔼delimited-[]superscriptsubscriptnormsubscript𝒔𝑡𝒙subscript𝒔𝑡superscript𝒙𝛼supsuperscriptsubscript𝐶italic-ϕ𝛼subscript𝑛𝑙𝔼delimited-[]superscript𝒩01𝛼subscriptnormsubscript𝒔𝑡1𝒙subscript𝒔𝑡1superscript𝒙𝛼\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq\frac{(C_{% \phi})^{\alpha}}{n_{l}}\leavevmode\nobreak\ \mathbb{E}[\leavevmode\nobreak\ |% \mathcal{N}(0,1)|^{\alpha}\leavevmode\nobreak\ ]\leavevmode\nobreak\ \Big{\|}% \bm{s}_{t-1}(\bm{x})-\bm{s}_{t-1}(\bm{x}^{\prime})\Big{\|}_{\alpha}\ ,blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT ] ≤ divide start_ARG ( italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG blackboard_E [ | caligraphic_N ( 0 , 1 ) | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] ∥ bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , (17)

where

Cϕsubscript𝐶italic-ϕ\displaystyle C_{\phi}italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT =σ02(𝒙)2Σ𝒙,𝒙+σ02(𝒙)absentsubscriptsuperscript𝜎20𝒙2subscriptΣ𝒙superscript𝒙subscriptsuperscript𝜎20superscript𝒙\displaystyle=\sigma^{2}_{0}(\bm{x})-2\Sigma_{\bm{x},\bm{x}^{\prime}}+\sigma^{% 2}_{0}(\bm{x}^{\prime})= italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x ) - 2 roman_Σ start_POSTSUBSCRIPT bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=η2nlϕ𝒔t1(𝒙)ϕ𝒔t1(𝒙)2( from Eq. (16) )absentsuperscript𝜂2subscript𝑛𝑙subscriptnormitalic-ϕsubscript𝒔𝑡1𝒙italic-ϕsubscript𝒔𝑡1superscript𝒙2( from Eq. (16) )\displaystyle=\frac{\eta^{2}}{n_{l}}\leavevmode\nobreak\ \Big{\|}\phi\circ\bm{% s}_{t-1}(\bm{x})-\phi\circ\bm{s}_{t-1}(\bm{x}^{\prime})\Big{\|}_{2}\qquad\text% {(\leavevmode\nobreak\ from Eq.\leavevmode\nobreak\ \eqref{eq:sigma}% \leavevmode\nobreak\ )}= divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ∥ italic_ϕ ∘ bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x ) - italic_ϕ ∘ bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( from Eq. ( ) )
η2Lϕ2nl𝒔t1(𝒙)𝒔t1(𝒙)2.absentsuperscript𝜂2superscriptsubscript𝐿italic-ϕ2subscript𝑛𝑙subscriptnormsubscript𝒔𝑡1𝒙subscript𝒔𝑡1superscript𝒙2\displaystyle\leq\frac{\eta^{2}L_{\phi}^{2}}{n_{l}}\leavevmode\nobreak\ \big{% \|}\bm{s}_{t-1}(\bm{x})-\bm{s}_{t-1}(\bm{x}^{\prime})\big{\|}_{2}\ .≤ divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ∥ bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Thus, Eq. (17) becomes

𝔼[𝒔t(𝒙)𝒔t(𝒙)αsup]Cη,θ,α𝒙𝒙α,𝔼delimited-[]superscriptsubscriptnormsubscript𝒔𝑡𝒙subscript𝒔𝑡superscript𝒙𝛼supsubscriptsuperscript𝐶𝜂𝜃𝛼subscriptnorm𝒙superscript𝒙𝛼\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C^{\prime}_% {\eta,\theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ ,blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ,

where

Cη,θ,α=(ηLϕ)αnl𝒔t1(𝒙)𝒔t1(𝒙)α𝔼[|𝒩(0,1)|α].superscriptsubscript𝐶𝜂𝜃𝛼superscript𝜂subscript𝐿italic-ϕ𝛼subscript𝑛𝑙subscriptnormsubscript𝒔𝑡1𝒙subscript𝒔𝑡1superscript𝒙𝛼𝔼delimited-[]superscript𝒩01𝛼C_{\eta,\theta,\alpha}^{\prime}=\frac{(\eta L_{\phi})^{\alpha}}{n_{l}}\big{\|}% \bm{s}_{t-1}(\bm{x})-\bm{s}_{t-1}(\bm{x}^{\prime})\big{\|}_{\alpha}\leavevmode% \nobreak\ \mathbb{E}[\leavevmode\nobreak\ |\mathcal{N}(0,1)|^{\alpha}% \leavevmode\nobreak\ ]\ .italic_C start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG ( italic_η italic_L start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ∥ bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT blackboard_E [ | caligraphic_N ( 0 , 1 ) | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] .

Iterating this argument, we obtain

𝔼[𝒔t(𝒙)𝒔t(𝒙)αsup]Cη,θ,α𝒙𝒙α,𝔼delimited-[]superscriptsubscriptnormsubscript𝒔𝑡𝒙subscript𝒔𝑡superscript𝒙𝛼supsubscript𝐶𝜂𝜃𝛼subscriptnorm𝒙superscript𝒙𝛼\mathbb{E}\left[\leavevmode\nobreak\ \|\bm{s}_{t}(\bm{x})-\bm{s}_{t}(\bm{x}^{% \prime})\|_{\alpha}^{\textrm{sup}}\leavevmode\nobreak\ \right]\leq C_{\eta,% \theta,\alpha}\|\bm{x}-\bm{x}^{\prime}\|_{\alpha}\ ,blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sup end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ,

where

Cη,θ,α=ηα(t+1)Lϕαt𝔼[|𝒩(0,1)|α]t+1.subscript𝐶𝜂𝜃𝛼superscript𝜂𝛼𝑡1superscriptsubscript𝐿italic-ϕ𝛼𝑡𝔼superscriptdelimited-[]superscript𝒩01𝛼𝑡1C_{\eta,\theta,\alpha}=\eta^{\alpha(t+1)}L_{\phi}^{\alpha t}\leavevmode% \nobreak\ \mathbb{E}[\leavevmode\nobreak\ |\mathcal{N}(0,1)|^{\alpha}% \leavevmode\nobreak\ ]^{t+1}\ .italic_C start_POSTSUBSCRIPT italic_η , italic_θ , italic_α end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT italic_α ( italic_t + 1 ) end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α italic_t end_POSTSUPERSCRIPT blackboard_E [ | caligraphic_N ( 0 , 1 ) | start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT .

The above induction holds for any positive even α𝛼\alphaitalic_α. Let β=αn0>0𝛽𝛼subscript𝑛00\beta=\alpha-n_{0}>0italic_β = italic_α - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, then this lemma is proved as desired. \hfill\square

Appendix E Tight Bound for Convergence

We begin this proof with the following lemmas.

Lemma E.10

Let f:n0:𝑓superscriptsubscript𝑛0f:\mathbb{R}^{n_{0}}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R be a Lipschitz continuous function with constant Cn0subscript𝐶subscript𝑛0C_{n_{0}}italic_C start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT denote the Gaussian distribution 𝒩(0,η2)𝒩0superscript𝜂2\mathcal{N}(0,\eta^{2})caligraphic_N ( 0 , italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), then for δ>0for-all𝛿0\forall\leavevmode\nobreak\ \delta>0∀ italic_δ > 0, there exists c>0𝑐0c>0italic_c > 0, s.t.

(|f(𝒙)f(𝒙)dPX(𝒙)|>δ)2ecδ2Cn02.𝑓𝒙𝑓superscript𝒙differential-dsubscript𝑃𝑋superscript𝒙𝛿2superscripte𝑐superscript𝛿2superscriptsubscript𝐶subscript𝑛02\mathbb{P}\left(\left|f(\bm{x})-\int f\left(\bm{x}^{\prime}\right)\mathop{}\!% \mathrm{d}P_{X}\left(\bm{x}^{\prime}\right)\right|>\delta\right)\leq 2\mathop{% }\!\mathrm{e}^{\frac{-c\delta^{2}}{C_{n_{0}}^{2}}}\ .blackboard_P ( | italic_f ( bold_italic_x ) - ∫ italic_f ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | > italic_δ ) ≤ 2 roman_e start_POSTSUPERSCRIPT divide start_ARG - italic_c italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT . (18)

Lemma E.10 shows that the Gaussian distribution corresponding to our samples satisfies the log-Sobolev inequality, i.e., Eq. (18), with some constants unrelated to dimension n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This result also holds for the uniform distributions on the sphere or unit hypercube [18].

Lemma E.11

Suppose that 𝐱1,,𝐱Nsubscript𝐱1subscript𝐱𝑁\bm{x}_{1},\dots,\bm{x}_{N}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are i.i.d. sampled from 𝒩(0,η2)𝒩0superscript𝜂2\mathcal{N}(0,\eta^{2})caligraphic_N ( 0 , italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), then with probability 1δ>01𝛿01-\delta>01 - italic_δ > 0, we have

𝒙i2=𝚯(n0)and|𝒙i,𝒙j|rn0N1/(r0.5),formulae-sequencesubscriptnormsubscript𝒙𝑖2𝚯subscript𝑛0andsuperscriptsubscript𝒙𝑖subscript𝒙𝑗𝑟subscript𝑛0superscript𝑁1𝑟0.5\|\bm{x}_{i}\|_{2}=\mathbf{\Theta}(\sqrt{n_{0}})\quad\text{and}\quad|\langle% \bm{x}_{i},\bm{x}_{j}\rangle|^{r}\leq n_{0}N^{-1/(r-0.5)}\ ,∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_Θ ( square-root start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) and | ⟨ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ≤ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / ( italic_r - 0.5 ) end_POSTSUPERSCRIPT ,

for ij𝑖𝑗i\neq jitalic_i ≠ italic_j, where

δNeΩ(n0)+N2eΩ(n0N2/(r0.5)).𝛿𝑁superscripteΩsubscript𝑛0superscript𝑁2superscripteΩsubscript𝑛0superscript𝑁2𝑟0.5\delta\leq N\mathop{}\!\mathrm{e}^{-\Omega(n_{0})}+N^{2}\mathop{}\!\mathrm{e}^% {-\Omega\left(n_{0}N^{-2/(r-0.5)}\right)}\ .italic_δ ≤ italic_N roman_e start_POSTSUPERSCRIPT - roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_e start_POSTSUPERSCRIPT - roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 2 / ( italic_r - 0.5 ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT .

From Definition 1 of the manuscript, we have

𝒙22dPX(𝒙)=𝚯(n0).superscriptsubscriptnorm𝒙22differential-dsubscript𝑃𝑋𝒙𝚯subscript𝑛0\int\|\bm{x}\|_{2}^{2}\mathop{}\!\mathrm{d}P_{X}(\bm{x})=\mathbf{\Theta}(n_{0}% )\ .∫ ∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_x ) = bold_Θ ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

Since 𝒙1,,𝒙nsubscript𝒙1subscript𝒙𝑛\bm{x}_{1},\dots,\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are i.i.d. sampled from PX=𝒩(0,η2)subscript𝑃𝑋𝒩0superscript𝜂2P_{X}=\mathcal{N}(0,\eta^{2})italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = caligraphic_N ( 0 , italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), for for-all\forall i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], we have 𝒙i22=𝚯(n0)superscriptsubscriptnormsubscript𝒙𝑖22𝚯subscript𝑛0\|\bm{x}_{i}\|_{2}^{2}=\mathbf{\Theta}(n_{0})∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_Θ ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with probability at least 1NeΩ(n0)1𝑁superscripteΩsubscript𝑛01-N\mathop{}\!\mathrm{e}^{\Omega(n_{0})}1 - italic_N roman_e start_POSTSUPERSCRIPT roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Provided 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the single-sided inner product 𝒙i,subscript𝒙𝑖\langle\bm{x}_{i},\cdot\rangle⟨ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋅ ⟩ is Lipschitz continuous with the constant Cn0=𝒪(n0)subscript𝐶subscript𝑛0𝒪subscript𝑛0C_{n_{0}}=\mathcal{O}(\sqrt{n_{0}})italic_C start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_O ( square-root start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ). As such, from Lemma E.10, for jifor-all𝑗𝑖\forall\leavevmode\nobreak\ j\neq i∀ italic_j ≠ italic_i, we have

(|𝒙i,𝒙j|>δ)2eδ2/Cn02.subscript𝒙𝑖subscript𝒙𝑗superscript𝛿2superscriptesuperscript𝛿2superscriptsubscript𝐶subscript𝑛02\mathbb{P}\left(|\langle\bm{x}_{i},\bm{x}_{j}\rangle|>\delta^{*}\right)\leq 2% \mathop{}\!\mathrm{e}^{-\delta^{2}/C_{n_{0}}^{2}}\ .blackboard_P ( | ⟨ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | > italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 roman_e start_POSTSUPERSCRIPT - italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_C start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT .

Then, for r2𝑟2r\geq 2italic_r ≥ 2, we have

(maxji|𝒙i,𝒙j|r>δ)N2eΩ(δ2).subscript𝑗𝑖superscriptsubscript𝒙𝑖subscript𝒙𝑗𝑟superscript𝛿superscript𝑁2superscripteΩsuperscriptsuperscript𝛿2\mathbb{P}\left(\max_{j\neq i}|\langle\bm{x}_{i},\bm{x}_{j}\rangle|^{r}>\delta% ^{*}\right)\leq N^{2}\mathop{}\!\mathrm{e}^{-\Omega\left({\delta^{*}}^{2}% \right)}.blackboard_P ( roman_max start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | ⟨ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT > italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_e start_POSTSUPERSCRIPT - roman_Ω ( italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT .

We complete the proof by setting δn0N1/(r0.5)superscript𝛿subscript𝑛0superscript𝑁1𝑟0.5\delta^{*}\leq n_{0}N^{-1/(r-0.5)}italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / ( italic_r - 0.5 ) end_POSTSUPERSCRIPT. \hfill\square

E.1 Full proof of Theorem 4.7

We start this proof with some notations. For convenience, we force n=|𝒔(1)|#=|𝒔(2)|#==|𝒔(L)|#𝑛subscriptsuperscript𝒔1#subscriptsuperscript𝒔2#subscriptsuperscript𝒔𝐿#n=|\bm{s}^{(1)}|_{\#}=|\bm{s}^{(2)}|_{\#}=\dots=|\bm{s}^{(L)}|_{\#}italic_n = | bold_italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT = | bold_italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT = ⋯ = | bold_italic_s start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT # end_POSTSUBSCRIPT, or equally, n=n1==nL𝑛subscript𝑛1subscript𝑛𝐿n=n_{1}=\dots=n_{L}italic_n = italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_n start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. We also abbreviate the covariance Cov(𝒔(l),𝒔(l))Covsuperscript𝒔𝑙superscript𝒔𝑙\mathrm{Cov}(\bm{s}^{(l)},\bm{s}^{(l)})roman_Cov ( bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) as 𝐂lsubscript𝐂𝑙\mathbf{C}_{l}bold_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT throughout this proof.

Unfolding the KUNK(l)superscriptsubscript𝐾UNK𝑙K_{\textrm{UNK}}^{(l)}italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT kernel equation that omits the epoch stamp

KUNK(l)(𝒙i,𝒙j)=𝔼[f(𝒙i;𝜽),f(𝒙j;𝜽)],for𝒙i,𝒙j𝒟,formulae-sequencesuperscriptsubscript𝐾UNK𝑙subscript𝒙𝑖subscript𝒙𝑗𝔼delimited-[]𝑓subscript𝒙𝑖𝜽𝑓subscript𝒙𝑗𝜽forsubscript𝒙𝑖subscript𝒙𝑗𝒟K_{\textrm{UNK}}^{(l)}(\bm{x}_{i},\bm{x}_{j})=\mathbb{E}[\langle f(\bm{x}_{i};% \bm{\theta}),f(\bm{x}_{j};\bm{\theta})\rangle],\quad\text{for}\quad\bm{x}_{i},% \bm{x}_{j}\in\mathcal{D}\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = blackboard_E [ ⟨ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_θ ) ⟩ ] , for bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_D , (19)

we have

KUNK(l)(𝒙i,𝒙j)=1M𝒛[κφκ+κ1κ2ϕκ1,κ2],superscriptsubscript𝐾UNK𝑙subscript𝒙𝑖subscript𝒙𝑗1subscript𝑀𝒛delimited-[]subscript𝜅subscript𝜑𝜅subscriptsubscript𝜅1subscript𝜅2subscriptitalic-ϕsubscript𝜅1subscript𝜅2K_{\textrm{UNK}}^{(l)}(\bm{x}_{i},\bm{x}_{j})=\frac{1}{M_{\bm{z}}}\left[\sum_{% \kappa}\varphi_{\kappa}+\sum_{\kappa_{1}\neq\kappa_{2}}\phi_{\kappa_{1},\kappa% _{2}}\right]\ ,italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT end_ARG [ ∑ start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (20)

where

{φl=𝔼[𝒔l,𝒔(l)],ψl1l2=p,q𝔼[𝒔p(l1)𝒔q(l2)],forl1l2,\left\{\begin{aligned} &\varphi_{l}=\mathbb{E}\left[\langle\bm{s}^{l},\bm{s}^{% (l)}\rangle\right]\ ,\\ &\psi_{l_{1}l_{2}}=\sum\nolimits_{p,q}\mathbb{E}\left[\bm{s}_{p}^{(l_{1})}\bm{% s}_{q}^{(l_{2})}\right],\quad\text{for}\quad l_{1}\neq l_{2}\ ,\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = blackboard_E [ ⟨ bold_italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⟩ ] , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT blackboard_E [ bold_italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ] , for italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW

in which the subscript p𝑝pitalic_p indicates the p𝑝pitalic_p-th element of vector 𝒔(l)superscript𝒔𝑙\bm{s}^{(l)}bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. From Theorem 1 of the manuscript, the sequence of random variables 𝒔(l)superscript𝒔𝑙\bm{s}^{(l)}bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is weakly dependent with β(t)𝛽𝑡\beta(t)\to\inftyitalic_β ( italic_t ) → ∞ as t𝑡t\to\inftyitalic_t → ∞. Thus, ψl1l2subscript𝜓subscript𝑙1subscript𝑙2\psi_{l_{1}l_{2}}italic_ψ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is an infinitesimal with respect to |l2l1|subscript𝑙2subscript𝑙1|l_{2}-l_{1}|| italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | when l1l2subscript𝑙1subscript𝑙2l_{1}\neq l_{2}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Invoking the following equations

{χmin(𝐏𝐐)χmin(𝐏)mini[m]𝐐(i,i)χmin(𝐏+𝐐)χmin(𝐏)+χmin(𝐐)\left\{\leavevmode\nobreak\ \begin{aligned} &\chi_{\min}(\mathbf{P}\mathbf{Q})% \geq\chi_{\min}(\mathbf{P})\min_{i\in[m]}\mathbf{Q}(i,i)\\ &\chi_{\min}(\mathbf{P}+\mathbf{Q})\geq\chi_{\min}(\mathbf{P})+\chi_{\min}(% \mathbf{Q})\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_PQ ) ≥ italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_P ) roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT bold_Q ( italic_i , italic_i ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_P + bold_Q ) ≥ italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_P ) + italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_Q ) end_CELL end_ROW

into Eq. (20), we have

χmin(KUNK(l))lχmin(𝐂l),subscript𝜒superscriptsubscript𝐾UNK𝑙subscript𝑙subscript𝜒subscript𝐂𝑙\chi_{\min}(K_{\textrm{UNK}}^{(l)})\geq\sum\nolimits_{l}\chi_{\min}\left(% \mathbf{C}_{l}\right)\ ,italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , (21)

and

chimin(𝐂l)χmin(𝐂l),forl[L].formulae-sequence𝑐subscript𝑖subscript𝐂𝑙subscript𝜒subscript𝐂𝑙for𝑙delimited-[]𝐿chi_{\min}\left(\mathbf{C}_{l}\right)\geq\chi_{\min}\left(\mathbf{C}_{l}\right% ),\quad\text{for}\quad l\in[L]\ .italic_c italic_h italic_i start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ≥ italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , for italic_l ∈ [ italic_L ] . (22)

Iterating Eq. (22) and then invoking it into Eq. (21), we have

χmin(KUNK(l))lχmin(𝐂1).subscript𝜒superscriptsubscript𝐾UNK𝑙subscript𝑙subscript𝜒subscript𝐂1\chi_{\min}\left(K_{\textrm{UNK}}^{(l)}\right)\geq\sum_{l}\chi_{\min}\left(% \mathbf{C}_{1}\right)\ .italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (23)

From the Hermite expansion [29] of ReLU function, we have

μr(ψ)=(1)r22(r3)!!/2πr!,subscript𝜇𝑟𝜓superscript1𝑟22double-factorial𝑟32𝜋𝑟\mu_{r}(\psi)=(-1)^{\frac{r-2}{2}}(r-3)!!/\sqrt{2\pi r!}\ ,italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_ψ ) = ( - 1 ) start_POSTSUPERSCRIPT divide start_ARG italic_r - 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( italic_r - 3 ) !! / square-root start_ARG 2 italic_π italic_r ! end_ARG , (24)

where r2𝑟2r\geq 2italic_r ≥ 2 indicates the expansion order. Thus, we have

χmin(𝐂1)subscript𝜒subscript𝐂1\displaystyle\chi_{\min}\left(\mathbf{C}_{1}\right)italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =χmin(ψ(𝐖(1)𝐗)ψ(𝐖(1)𝐗))absentsubscript𝜒𝜓superscript𝐖1𝐗𝜓superscriptsuperscript𝐖1𝐗top\displaystyle=\chi_{\min}\left(\psi(\mathbf{W}^{(1)}\mathbf{X})\psi(\mathbf{W}% ^{(1)}\mathbf{X})^{\top}\right)= italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ψ ( bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_X ) italic_ψ ( bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) (25)
μr(ϕ)2χmin(𝐗(r)(𝐗(r)))absentsubscript𝜇𝑟superscriptitalic-ϕ2subscript𝜒superscript𝐗𝑟superscriptsuperscript𝐗𝑟top\displaystyle\geq\mu_{r}(\phi)^{2}\chi_{\min}\left(\mathbf{X}^{(r)}\left(% \mathbf{X}^{(r)}\right)^{\top}\right)≥ italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT )
μr(ψ)2(mini[N]𝒙i22r(N1)maxji|𝒙i,𝒙j|r)absentsubscript𝜇𝑟superscript𝜓2subscript𝑖delimited-[]𝑁superscriptsubscriptnormsubscript𝒙𝑖22𝑟𝑁1subscript𝑗𝑖superscriptsubscript𝒙𝑖subscript𝒙𝑗𝑟\displaystyle\geq\mu_{r}(\psi)^{2}\left(\min_{i\in[N]}\|\bm{x}_{i}\|_{2}^{2r}-% (N-1)\max_{j\neq i}|\langle\bm{x}_{i},\bm{x}_{j}\rangle|^{r}\right)≥ italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_ψ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT - ( italic_N - 1 ) roman_max start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | ⟨ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT )
μr(ψ)2Ω(n0),absentsubscript𝜇𝑟superscript𝜓2Ωsubscript𝑛0\displaystyle\geq\mu_{r}(\psi)^{2}\leavevmode\nobreak\ \Omega(n_{0})\ ,≥ italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_ψ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

where the superscript (r)𝑟(r)( italic_r ) denotes the r𝑟ritalic_r-th Khatri Rao power of the matrix 𝐗=[𝒙1,,𝒙N]𝐗subscript𝒙1subscript𝒙𝑁\mathbf{X}=[\bm{x}_{1},\dots,\bm{x}_{N}]bold_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ], the first inequality follows from Eq. (24), the second one holds from Gershgorin Circle Theorem [24], and the third one follows from Lemma E.11. Therefore, we can obtain the lower bound of the smallest eigenvalue by plugging Eq. (25) into Eq. (23).

On the other hand, it is observed from Lemma 4.4 that for l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ],

{𝒔p(l)22=𝔼𝐖p(l)[ψ(𝐖p(l)𝒔(l1))2]=𝒔q(l)2,forqp,𝒔(l)22=𝔼𝐖(l)[ψ(𝐖(l)𝒔(l1))2]𝒔(l)22.\left\{\leavevmode\nobreak\ \begin{aligned} &\|\bm{s}_{p}^{(l)}\|^{2}_{2}=% \mathbb{E}_{\mathbf{W}^{(l)}_{p}}\left[\psi(\mathbf{W}^{(l)}_{p}\bm{s}^{(l-1)}% )^{2}\right]=\|\bm{s}_{q}^{(l)}\|^{2},\quad\text{for}\quad\forall q\neq p,\\ &\|\bm{s}^{(l)}\|_{2}^{2}=\mathbb{E}_{\mathbf{W}^{(l)}}\left[\psi(\mathbf{W}^{% (l)}\bm{s}^{(l-1)})^{2}\right]\leq\|\bm{s}^{(l)}\|_{2}^{2}\ .\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL ∥ bold_italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_ψ ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ∥ bold_italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , for ∀ italic_q ≠ italic_p , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∥ bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_ψ ( bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ ∥ bold_italic_s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (26)

Thus, we have

χmin(KUNK(l))subscript𝜒superscriptsubscript𝐾UNK𝑙\displaystyle\chi_{\min}(K_{\textrm{UNK}}^{(l)})italic_χ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) tr(KUNK(l))N=1NiNKUNK(l)(𝒙i,𝒙i)absenttrsuperscriptsubscript𝐾UNK𝑙𝑁1𝑁superscriptsubscript𝑖𝑁superscriptsubscript𝐾UNK𝑙subscript𝒙𝑖subscript𝒙𝑖\displaystyle\leq\frac{\mathop{}\!\mathrm{tr}(K_{\textrm{UNK}}^{(l)})}{N}=% \frac{1}{N}\sum_{i}^{N}K_{\textrm{UNK}}^{(l)}(\bm{x}_{i},\bm{x}_{i})≤ divide start_ARG roman_tr ( italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT UNK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
1NiN1M𝒛[lφl+l1l2ψl1l2]absent1𝑁superscriptsubscript𝑖𝑁1subscript𝑀𝒛delimited-[]subscript𝑙subscript𝜑𝑙subscriptsubscript𝑙1subscript𝑙2subscript𝜓subscript𝑙1subscript𝑙2\displaystyle\leq\frac{1}{N}\sum_{i}^{N}\frac{1}{M_{\bm{z}}}\left[\sum_{l}% \varphi_{l}+\sum_{l_{1}\neq l_{2}}\psi_{l_{1}l_{2}}\right]≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT end_ARG [ ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]
1NiN(1llmaxj[N]𝒙j22+Ω(n0))absent1𝑁superscriptsubscript𝑖𝑁1𝑙subscript𝑙subscript𝑗delimited-[]𝑁superscriptsubscriptnormsubscript𝒙𝑗22Ωsubscript𝑛0\displaystyle\leq\frac{1}{N}\sum_{i}^{N}\left(\frac{1}{l}\sum_{l}\max_{j\in[N]% }\|\bm{x}_{j}\|_{2}^{2}+\Omega(n_{0})\right)≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j ∈ [ italic_N ] end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
𝚯(n0),absent𝚯subscript𝑛0\displaystyle\leq\mathbf{\Theta}(n_{0})\ ,≤ bold_Θ ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

where the second inequality follows from Eq. (20), the third one follows from Eq. (26), and the fourth one holds from Lemma E.11. This completes the proof. \hfill\square

Appendix F Supplementary Experimental Results

This section provides the detailed experimental results. Table 3 lists the optimal trajectory and the corresponding testing accuracy of Grid 0.001 and Grid 0.01 over the epoch. Figure 3 draws the training correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of {0.001,0.01,0.1,0,1,10}0.0010.010.10110\{0.001,0.01,0.1,0,1,10\}{ 0.001 , 0.01 , 0.1 , 0 , 1 , 10 }. Figure 4 draws the testing correlation histograms and the averages for our proposed UNK kernel with the grid search granularity of {0.001,0.01,0.1,0,1,10}0.0010.010.10110\{0.001,0.01,0.1,0,1,10\}{ 0.001 , 0.01 , 0.1 , 0 , 1 , 10 }.

Epoch Baseline Grid 0.001 Grid 0.01
t𝑡titalic_t Testing ACC. Training ACC. λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Testing ACC. Training ACC. λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Testing ACC. Training ACC.
1 0.1325 0.1289 0.0100 0.9287 0.9257 0.0800 0.9291 0.9266
2 0.9284 0.9256 0.0020 0.9515 0.9506 0.0800 0.9527 0.9521
3 0.9514 0.9504 0.0040 0.9607 0.9631 0.0900 0.9631 0.9656
4 0.9603 0.9629 0.0080 0.9665 0.9708 0.0700 0.9693 0.9737
5 0.9658 0.9705 0.0070 0.9709 0.9766 0.0900 0.9729 0.9793
6 0.9705 0.9763 0.0050 0.9738 0.9802 0.1000 0.9757 0.9839
7 0.9733 0.9800 0.0060 0.9756 0.9834 0.1000 0.9785 0.9870
8 0.9753 0.9831 0.0000 0.9772 0.9858 0.0800 0.9795 0.9899
9 0.9769 0.9855 0.0080 0.9789 0.9879 0.0500 0.9805 0.9922
10 0.9788 0.9875 0.0000 0.9798 0.9898 0.0900 0.9818 0.9939
11 0.9800 0.9896 0.0000 0.9809 0.9913 0.0600 0.9826 0.9952
12 0.9809 0.9910 0.0000 0.9814 0.9923 0.0600 0.9833 0.9963
13 0.9813 0.9922 0.0040 0.9814 0.9933 0.0700 0.9833 0.9971
14 0.9814 0.9931 0.0020 0.9815 0.9943 0.0800 0.9837 0.9977
15 0.9815 0.9941 0.0020 0.9815 0.9952 0.0500 0.9841 0.9984
16 0.9814 0.9949 0.0080 0.9819 0.9959 0.0700 0.9848 0.9987
17 0.9816 0.9957 0.0060 0.9824 0.9966 0.0900 0.9847 0.9992
18 0.9818 0.9963 0.0070 0.9827 0.9972 0.0700 0.9851 0.9995
19 0.9825 0.9969 0.0070 0.9830 0.9977 0.0000 0.9850 0.9996
20 0.9824 0.9974 0.0100 0.9833 0.9981 0.0800 0.9857 0.9998
21 0.9831 0.9978 0.0070 0.9834 0.9984 0.0100 0.9847 0.9997
22 0.9830 0.9982 0.0100 0.9838 0.9986 0.0200 0.9850 0.9999
23 0.9831 0.9984 0.0050 0.9835 0.9987 0.0000 0.9847 0.9999
24 0.9834 0.9986 0.0000 0.9836 0.9989 0.0000 0.9843 0.9999
25 0.9835 0.9988 0.0050 0.9830 0.9990 0.0000 0.9848 0.9999
26 0.9837 0.9989 0.0030 0.9838 0.9992 0.0000 0.9845 1.0000
27 0.9834 0.9990 0.0000 0.9834 0.9992 0.0000 0.9852 1.0000
28 0.9833 0.9991 0.0000 0.9839 0.9994 0.0000 0.9848 1.0000
29 0.9834 0.9993 0.0000 0.9834 0.9994 0.0000 0.9848 1.0000
30 0.9836 0.9993 0.0020 0.9838 0.9995 0.0000 0.9850 1.0000
Table 3: Illustration of λtsubscriptsuperscript𝜆𝑡\lambda^{*}_{t}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the corresponding (both training and testing) accuracy (ACC.) of Grid 0.001 and Grid 0.01 over epoch t𝑡titalic_t.
Refer to caption
Figure 3: Histograms of training correlation of (a) Grid 0.001, (b) Grid 0.01, (c) Grid 0.1, (d) Grid 0, (e) Grid 1, and (f) Grid 10, where x- and y-axes denote the number of training instances and the corresponding correlation, respectively.
Refer to caption
Figure 4: Histograms of testing correlation of (a) Grid 0.001, (b) Grid 0.01, (c) Grid 0.1, (d) Grid 0, (e) Grid 1, and (f) Grid 10, where x- and y-axes denote the number of testing instances and the corresponding correlation, respectively.