nmODE: neural memory ordinary differential equation

Yi, Zhang

doi:10.1007/s10462-023-10496-2

nmODE: neural memory ordinary differential equation

Open access
Published: 22 May 2023

Volume 56, pages 14403–14438, (2023)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

nmODE: neural memory ordinary differential equation

Download PDF

Zhang Yi¹

4513 Accesses
21 Citations
Explore all metrics

Abstract

Brain neural networks are regarded as dynamical systems in neural science, in which memories are interpreted as attractors of the systems. Mathematically, ordinary differential equations (ODEs) can be utilized to describe dynamical systems. Any ODE that is employed to describe the dynamics of a neural network can be called a neuralODE. Inspired by rethinking the nonlinear representation ability of existing artificial neural networks together with the functions of columns in the neocortex, this paper proposes a theory of memory-based neuralODE, which is composed of two novel artificial neural network models: nmODE and $\epsilon$-net, and two learning algorithms: nmLA and $\epsilon$-LA. The nmODE (neural memory Ordinary Differential Equation) is designed with a special structure that separates learning neurons from memory neurons, making its dynamics clear. Given any external input, the nmODE possesses the global attractor property and is thus embedded with a memory mechanism. The nmODE establishes a nonlinear mapping from the external input to its associated attractor and does not have the problem of learning features homeomorphic to the input data space, as occurs frequently in most existing neuralODEs. The nmLA (neural memory Learning Algorithm) is developed by proposing an interesting three-dimensional inverse ODE (invODE) and has advantages in memory and parameter efficiency. The proposed $\epsilon$-net is a discrete version of the nmODE, which is particularly feasible for digital computing. The proposed $\epsilon$-LA ($\epsilon$ learning algorithm) requires no prior knowledge of the number of network layers. Both nmLA and $\epsilon$-LA have no problem with gradient vanishing. Experimental results show that the proposed theory is comparable to state-of-the-art methods.

Associative Memory Networks with Multidimensional Neurons

Bio-inspired Neural Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Human brain contains a lot of connected neurons, which establish a neural network within the brain. It is believed that intelligence is originated from this neural network. Electricity, interpreted as information, flows in the brain neural network and forms a dynamical system. Memories are regarded as attractors of the dynamical system of neural activity (Poucet and Save 2005; Wills et al. 2005). Mathematically, ordinary differential equations (ODEs) can be used to describe dynamical systems. Thus, brain neural networks could be modeled by ODEs to some extent. The history of using ODEs to model neural networks could be traced back to an early time. For example, the famous Hodgkin-Huxley model (Hodgkin and Huxley 1952), a model of the squid giant axon appeared in 1952. A lot of research works have been reported on using ODEs for studying neural networks, see, for examples, Cohen and Grossberg (1983); Grossberg (2013); Hopfield (1984); Zhang et al. (2018); Yi et al. (2003); Yi and Tan (2004); Yi et al. (2009); Yi (2010); Liu et al. (2017).

In recent years, deep learning has achieved great success in many practical applications. Various deep learning models have been reported. One of the well-known deep learning models is ResNet (He et al. 2016). Recent studies have shown that ResNet could be interpreted as a discrete version of some ODEs (Haber and Ruthotto 2017). Since then, increasing interests have been brought in using ODEs to study problems arising from deep learning. In Chen et al. (2018), the concept of neuralODE was proposed to describe a continuous version of the ResNet by using ODE, which showed that the neuralODE has several benefits such as memory efficiency, adaptive computation, parameter efficiency, etc. In our view, any ODE used to model a neural network can be referred to as a neuralODE. In Chang et al. (2018), a neuralODE model with stable architectures was proposed to solve the well-posed learning problem for arbitrarily deep networks by ensuring the stability of the neuralODE. In Chang et al. (2019), a neuralODE model with antisymmetric structure was proposed. It also showed that this model exhibits much more predictable dynamics. In Dupont et al. (2019), authors proved that neuralODEs using initial value as data input can learn only features homeomorphic to the input data space, some simple functions were provided to show the limitation of neuralODEs. To address this problem, augmented neuralODEs were proposed by adding new augments to the networks (Dupont et al. 2019). However, additional computations should be required for the augmentation. More other recent research work on neuralODEs can be found in Zhang et al. (2019); Kidger (2021); Chen et al. (2021).

In general, neuralODEs in their general form are not particularly useful for applications. Carefully designed neuralODEs with simple structures and clear dynamics are always desirable from an application point of view. In the field of artificial neural networks, an important task is to use neural networks to represent complex nonlinear mappings. In the past decades, various artificial neural networks with different representation abilities have been proposed. Studies show that the representation ability of the one-layer neural networks is limited. For example, a one-layer neural network using non-decreasing activation functions, such as sigmoid or ReLU activation functions, cannot be used to represent the well-known XOR mapping, which means it cannot solve the XOR nonlinear classification problem. This is because the nonlinearity of one-layer networks is insufficient. To increase the representation ability of neural networks, their nonlinearity must be enhanced. One way to achieve this is by adding layers, i.e., using multiple layers to build networks. However, a drawback of this method is that the networks may contain a large number of parameters.

In this paper, we propose two methods to enhance the nonlinearity of artificial neural networks. The first method involves using the $\sin ^{2}(\cdot )$ function as an alternative activation function, rather than nondecreasing functions. In a previous study Sitzmann et al. (2020), the $\sin (\cdot )$ function was suggested as an activation function for neural networks, and the authors showed several advantages of using this function. The $\sin ^{2}(\cdot )$ function inherits some of the advantages of the $\sin (\cdot )$ function, and in addition, has even higher nonlinearity. Therefore, neural networks using this activation function can more efficiently improve their nonlinearity.

The second method involves the development of an implicit mapping equation by modifying the model of one-layer networks in an interesting way, resulting in a nonlinear mapping with higher nonlinearity. We will demonstrate that for each input, the implicit mapping equation has a unique solution, thereby establishing a mapping from input to its associated unique solution.

Straightly solving the implicit mapping equation is impossible, this paper proposes an ODE to indirectly solve it. The proposed ODE has an important property: it has a global attractor for each external input. This attractor is exactly the unique solution of the mentioned above implicit mapping equation. Regarding the attractor as memory, the proposed ODE can be looked as a memory based artificial neural network and we simply call it nmODE. The nmODE is designed with a particular simple structure, which separates learning neurons from memory neurons, and thus endowed with quite clear dynamics. Moreover, different from most existing neuralODEs that use ODE’s initial value as data input, we use an external input variable in nmODE to input data. Thus, the nmODE does not have the problem of learning features homeomorphic to the input data space occurred frequently in most existing neuralODEs (Dupont et al. 2019). Given an external input, the nmODE maps the input to its global attractor, thus, nmODE establishes a mapping from input data space to memory space.

The nmODE can be viewed as a mathematical model of columns in the neocortex. Recent research in theoretical neuroscience suggests that a column may function as a unit of intelligence, and that there may be a common algorithm for all columns (Hawkins and Dawkins 2021). A column contains many neurons belonging to different classes. Since memory plays a crucial role in intelligence, we propose a hypothesis that there is a class of neurons for memory within a column, which we call memory neurons. Furthermore, we propose that memory neurons share a common function, which we can implement by defining a mathematical model for each neuron using a one-dimensional ODE, called neuronODE. The dynamical properties of the neuronODE can be considered the common function of memory neurons. The nmODE can be seen as built upon the neuronODE block.

A learning algorithm called nmLA is developed for the nmODE, which involves an interesting three-dimensional inverse ordinary differential equation called invODE, used for computing gradients of the cost function. The invODE has a low dimension of only three, making it easy to solve. This is unlike most existing neuralODEs that use high-dimensional inverse ODEs, which are inefficient for computing. Moreover, the nmLA does not suffer from the gradient vanishing problem, which occurs frequently in traditional neural networks.

Both nmODE and nmLA are described by continuous equations, which are not particularly feasible for digital computing. In this paper, we propose a discrete version of nmODE called $\epsilon$-net, along with a corresponding learning algorithm called $\epsilon$-LA. Both $\epsilon$-net and $\epsilon$-LA are designed for easy digital computing. We propose neuronDTE, a discrete time equation math model for building $\epsilon$-net, and propose neuronADE, a two-dimensional equation, to facilitate computing gradients in the $\epsilon$-LA algorithm. These two functions make computing much more efficient.

The main contributions of this paper can be summarized as follows:

1.
We propose an interesting neural network, called the nmODE, which has clear dynamics and a global attractor property. It incorporates a memory mechanism using attractors in the network, establishing a relationship between external input and memory. The structure of the nmODE is unique in that it separates learning from memory, with each learning connection only existing from an input neuron to a memory neuron. No learning connections exist between two memory neurons. We also propose a math model of the neuron, called the neuronODE, which the nmODE is built upon.
2.
An efficient learning algorithm called nmLA is developed, which uses an interesting three-dimensional ODE called invODE to compute gradients. This approach avoids the gradient vanishing problem and enables efficient computation.
3.
We propose a discrete version of the nmODE, called $\epsilon$-net, which has several desirable dynamical properties, including the global attractor property. We define a math model of the neuron described in a discrete time equation, called neuronDTE, which can be used as a building block for the $\epsilon$-net.
4.
A learning algorithm called $\epsilon$-LA is proposed for the $\epsilon$-net. Unlike most feedforward neural networks, the $\epsilon$-LA does not require the number of network layers to be predetermined. By proposing a two-dimensional equation, called neuronADE, to facilitate gradient computation, $\epsilon$-LA achieves high efficiency. Moreover, being a forward computing algorithm, $\epsilon$-LA does not suffer from the gradient vanishing problem.

The organization of the rest of this paper is as follows: Sect. 2 presents the terminologies and problem formulation. Section 3 proposes the nmODE and nmLA, starting with an analysis of the representation ability of a one-layer network to construct an implicit mapping equation, which leads to the development of the nmODE. The learning algorithm, nmLA, is also presented in this section, along with a brief review of nmODE from a biological point of view. Section 4 presents the $\epsilon$-net and $\epsilon$-LA. Section 5 details the experiments conducted. Finally, Sect. 6 concludes with discussions and conclusions.

2 Terminologies and problem formulation

Biological neural networks are organized through the interconnection of neurons, allowing information to flow and form dynamic systems. ODEs are commonly used to describe dynamical systems, and can therefore be used to model neural networks. An ODE is referred to as a neural ordinary differential equation (neuralODE) when it is used to describe the dynamics of a neural network. All the ODEs discussed in this paper are considered neuralODEs.

A general neuralODE can be described as

$$\begin{aligned} {\dot{y}} = f(y,x) \end{aligned}$$

(1)

for $t \ge 0$, where $y \in R^{n}$ denotes the state of the network, $x \in R^{m}$ represents external input variable and we take it for data inputting, the function f is continuous and satisfies some Lipschitz condition so that the existence and uniqueness of solution can be guaranteed. Given an initial y(0), the trajectory of equation (1) starting from y(0) is denoted by y(t, y(0)) or simply y(t), for all $t \ge 0$.

A set $S \subset R^{n}$ is called an invariant set of equation (1), if for any $y(0) \in S$, the trajectory y(t) starting from y(0) remains in S forever.

A vector $y^{*} \in R^{n}$ is called an equilibrium point of the neuralODE if it satisfies

$$\begin{aligned} f(y^{*}, x) = 0. \end{aligned}$$

It is known that in biological neural networks, memories are encoded as attractors in the dynamics of the networks (Poucet and Save 2005; Wills et al. 2005). A suitably designed neuralODE can also have the ability to store memories as attractors in the biological neural networks. Let us define the concept of attractor for the general neuralODE first. There are several kinds of concept of attractors, among all, the concept of global attractor has a nice property in particular. A global attractor must be, first, an equilibrium point, in addition, it attracts all trajectories. Formally, an equilibrium point $y^{*}$ is called a global attractor if for any y(0), the associated trajectory y(t, y(0)) converges to $y^{*}$ as $t \rightarrow \infty$. Suppose that the neuralODE has the global attractor property, then, the neuralODE establishes a well-defined nonlinear mapping from input x to the associated attractor $y^{*}$ as

$$\begin{aligned} \text{ neuralODE }: x \longrightarrow y^{*}. \end{aligned}$$

Figure 1 shows this mapping.

Not every neuralODE has the global attractor property, designing a neuralODE possesses the global attractor property with simple structure and clear dynamics is desirable.

There are two approaches for inputting data into neuralODEs. One approach is to keep the external input x fixed and use the initial y(0) as the network input, while the other is to keep the initial y(0) fixed and use the external input x as the network input. Most existing neuralODEs adopt the first approach. In this paper, we use the second approach for data input. There are fundamental differences between these two approaches. When using the initial y(0) as the data input, neuralODEs have a significant drawback in learning representations that preserve the topology of the input data space. In fact, there are simple functions that neuralODEs cannot represent (Dupont et al. 2019). To address this issue, augmented neuralODEs were proposed in Dupont et al. (2019), but they require additional computations. On the other hand, when using x as the data input, this problem does not arise. To illustrate this further, consider the simple nonlinear function defined by:

$$\begin{aligned} \left\{ \begin{array}{ll} g(-1) = 1\\ g(1) = -1 \end{array} \right. \end{aligned}$$

(2)

In Dupont et al. (2019), it was proven that there is no ODE that can represent the function g if the initial y(0) is used as the data input. However, if we use x as the data input, this problem can be easily solved. The simple one-dimensional ODE

$$\begin{aligned} {\dot{y}} = -y - x \end{aligned}$$

(3)

can perfectly represent the function g. This can be verified by solving the ODE and obtaining the solution:

$$\begin{aligned} y(t) = \left[ y(0) + x\right] e^{-t} - x \end{aligned}$$

for $t \ge 0$.

3 Neural memory ordinary differential equations

Before introducing the nmODE model, let us first consider the following two motivations: rethinking the one-layer network and utilizing implicit mapping functions.

3.1 On the one-layer network

The well-known model of a one-layer neural network can be expressed as

$$\begin{aligned} y = f\left[ W^{(1)} x + b \right] , \end{aligned}$$

(4)

where $W^{(1)}$ is an $n\times m$ matrix, x is the input vector with m dimensions, b is the bias vector with n dimensions, and y is the output vector with n dimensions. The activation function f is traditionally a nondecreasing function, such as the sigmoid function, tangent function, or ReLu function, among others. The network (4) defines a mapping from the input x to the output y, as shown in Fig. 2.

It is well-known that the one-layer neural network can be applied successfully to solve linear classification problems. However, it cannot solve nonlinear classification problems with a nondecreasing activation function. A classic example is the XOR problem, where the goal is to classify four points:

$$\begin{aligned} \left\{ \left[ \begin{array}{ll} 0\\ 1 \end{array} \right] , \left[ \begin{array}{ll} 1\\ 0 \end{array} \right] , \left[ \begin{array}{ll} 0\\ 0 \end{array} \right] , \left[ \begin{array}{ll} 1\\ 1 \end{array} \right] \right\} \end{aligned}$$

on a plane into two classes: the first two points in one class and the remaining two in another class, as illustrated in Fig. 3.

The one-layer network cannot solve this classification problem because its nonlinearity is not sufficiently high. In fact, the classification ability of one-layer network is determined by the linear decision surface

$$\begin{aligned} W x + b = 0. \end{aligned}$$

If f is a nondecreasing function, the mapping of one-layer neural network is topologically linear, thus, it cannot solve nonlinear classification problems such as the XOR problem.

Increasing the nonlinearity of neural networks is crucial for computational ability. A simple method to increase the nonlinearity of the one-layer network is to remove the constraint of nondecreasing on activation functions. An example is to use $\sin ^{2}(\cdot )$ function as activation function, i.e.,

$$\begin{aligned} y = \sin ^{2}\left[ W^{(1)} x + b \right] . \end{aligned}$$

(5)

The nonlinearity of this network is now much higher than that using nondecreasing activation function. One-layer network with this newly endowed activation function can solve the XOR problem easily by designing the parameters of the network (5) as

$$\begin{aligned} y = \sin ^{2}\left[ -\frac{\pi }{2} \left( x_{1}+x_{2}\right) + \frac{\pi }{2} \right] . \end{aligned}$$

(6)

It can be easily checked that the network (6) maps the points of $[0,0]^T, [1,1]^T$ to 1, while the points of $[0,1]^T, [1,0]^T$ to 0, thus, the XOR problem is perfectly solved, see Fig. 4.

3.2 An implicit mapping equation

Another approach to increase the nonlinearity of the one-layer network is to use implicit functions. By making a simple modification to the representation of the network, we can obtain an implicit mapping equation of the form:

$$\begin{aligned} y = f\left[ y + W^{(1)} x + b \right] . \end{aligned}$$

(7)

This implicit mapping equation defines a nonlinear mapping F from input x to output y as:

$$\begin{aligned} F: x \rightarrow y, \end{aligned}$$

where F is an implicit function, and it is clear that the nonlinearity of F is higher than that of f.

As discussed in Sect. 2, using the $\sin ^2(\cdot )$ function as the activation function in the one-layer network can increase its nonlinearity and solve nonlinear classification problems such as the XOR problem. Together with the above methods of increasing nonlinearity, we propose a specific implicit mapping equation by

$$\begin{aligned} y = \sin ^{2} \left[ y + W^{(1)}x + b \right] , \end{aligned}$$

(8)

or equivalently in component form as

$$\begin{aligned} y_{i} = \sin ^{2}\left[ y_{i} + \sum ^{m}_{j=1} w^{(1)}_{ij} x_{j} + b_{i} \right] , (i=1, \cdots , n). \end{aligned}$$

We will show that this implicit mapping equation (8) has a unique solution. First, we will prove a lemma.

Lemma 1

The one-dimensional equation

$$\begin{aligned} p = \sin ^{2}(p + \gamma ) \end{aligned}$$

(9)

has one and only one solution $p^{*}$, where $\gamma$ is a constant.

Proof

Define a function

$$\begin{aligned} \mu (p) = p - \sin ^{2}(p + \gamma ). \end{aligned}$$

Clearly, ${\dot{\mu }}(p) \ge 0$, i.e., $\mu$ is a nondecreasing function. Next, we consider three cases, $p=1$, $p=0$, and $p \in (0, 1)$, respectively.

Obviously, $p=1$ is a solution of (9) if and only if $\gamma = k\pi + \pi /2 - 1$ for any integer k. Similarly, $p=0$ is a solution of (9) if and only if $\gamma = k\pi$ for any integer k.

Since ${\dot{\mu }}(p) > 0$ for all $p \in (0, 1)$, it can be easily proved that there must exist one and only one $p^{*}$ such that $\mu (p^{*}) = 0$.

The above statements prove that the equation (9) has one and only one solution $p^{*}$. The proof is complete. $\square$

Property 1

Given any x, the implicit mapping equation (8) has one and only one solution.

Proof

By Lemma 1, there exists one and only one $y^{*}_{i}$ such that

$$\begin{aligned} y_{i} = \sin ^{2}\left[ y_{i} + \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right] \end{aligned}$$

for each i. Denote by $y^{*} = (y_{1}^{*}, \cdots , y_{n}^{*})$, clearly, $y^{*}$ is the only one solution of the implicit equation (8). The proof is complete.

The implicit mapping equation clearly defines an implicit function from x to y. Directly solving this equation to establish a mapping from input x to output y is generally impossible. Next, we will propose an ODE to solve this problem. $\square$

3.3 The proposed nmODE

The implicit mapping equation can be viewed as an equilibrium equation of an ordinary differential equation. To solve this equilibrium equation, an ODE, named nmODE (neural memory Ordinary Differential Equation), is proposed in this paper as follows:

$$\begin{aligned} {\dot{y}} = - y + \sin ^{2} \left[ y + W^{(1)} x + b \right] \end{aligned}$$

(10)

or in component form

$$\begin{aligned} {\dot{y}}_{i} = - y_{i} + \sin ^{2}\left[ y_{i} + \sum ^{m}_{j=1} w^{(1)}_{ij} x_{j} + b_{i} \right] , (i=1, \cdots , n) \end{aligned}$$

(11)

for $t \ge 0$, where $y \in R^{n}$, $x \in R^{m}$, $b \in R^{n}$, $W^{(1)}=\left( w^{(1)}_{ij}\right) _{n \times m} \in R^{n \times m}$. From the neural networks point of view, x represents external input to the network, y(t) denotes the state of the network at time t, and y(0) is the initial value. Each neuron used to represent the external input variable x is called an input neuron, while any neuron used to represent the state y is called a memory neuron.

Are there anything special in the nmODE? First, the nmODE is a decoupled system for the memory neurons, which makes it particularly easy for mathematical analysis of its dynamics. In addition, the nmODE can be solved simply by solving the one-dimensional ODEs for each i in equation (11) since these one-dimensional ODEs are independent each other. Second, each learning connection is only connected from an input neuron to memory neurons, and there is no learning connections among any memory neurons, thus, learning is separated from memory. Such structure makes the model particularly easy for network learning. Third, we will show that nmODE has global attractor property for each external input. By regarding the attractor as a memory (Poucet and Save 2005; Wills et al. 2005), the memory mechanism can be embedded into the network. Fourth, nmODE could be looked as a math model of the column in neocortex. It was suggested by recent reported research work on theory neural science that each column in neocortex may play a role of intelligence unit Hawkins and Dawkins (2021), if so, the column must have a complete memory mechanism so that it can model the world.

As the nmODE is a decoupled system, we can define a math model for each neuron as

$$\begin{aligned} {\dot{p}} = -p + \sin ^{2} \left( p + \gamma \right) , \end{aligned}$$

(12)

where $p \in R^{1}$ and $\gamma$ represents perception input to the neuron. We call this one-dimensional ODE as neuronODE. Clearly, the nmODE can be looked as built by the neuronODE block. Different from many traditional math models of neurons described in simple functions, the neuronODE is described by a dynamical system of ODE.

Next, we will show that the proposed nmODE has several interesting dynamical properties.

Property 2

The set

$$\begin{aligned} S = \left\{ \left. y \in R^{n} \right| \Vert y\Vert \le 1 \right\} \end{aligned}$$

is an invariant set of the nmODE.

Proof

Given any initial $y(0) \in S$, by integrating the differential equation, it gives that

$$\begin{aligned} y_{i}(t) = y_{i}(0) e^{-t} + \int ^{t}_{0} e^{-(t-s)} \cdot \sin ^{2} \left[ y_{i}(s) + \sum _{j=1}^{m} w^{(1)}_{ij} x_{j} + b_{j}\right] ds \end{aligned}$$

(13)

then,

$$\begin{aligned} y_{i}(t)\le & {} y_{i}(0)e^{-t} + \int ^{t}_{0} e^{-(t-s)} ds \end{aligned}$$

(14)

$$\begin{aligned}= & {} \left( 1- y_{i}(0) \right) \cdot e^{-t} \end{aligned}$$

(15)

$$\begin{aligned}\le & {} 1 \end{aligned}$$

(16)

for all $t \ge 0$. Thus, $y(t) \in S$ for all $t \ge 0$, i.e., the set S is an invariant set. The proof is complete. $\square$

Property 3

The nmODE (10) has one and only one equilibrium point.

The above property 3 is directly from the property 1.

Lemma 2

The neuronODE (12) has one and only one global attractor.

Proof

By property 3, there exists a unique $p^{*}$ such that

$$\begin{aligned} p^{*} = \sin ^{2}(p^{*} + \gamma ). \end{aligned}$$

Define a differentiable function

$$\begin{aligned} V(t)=\frac{1}{2}\left( p(t)-p^{*}\right) ^{2} \end{aligned}$$

for $t \ge 0$. Given any initial value p(0), along any trajectory p(t) starting from p(0), it gives that

$$\begin{aligned} {\dot{V}}(t)= & {} -\left( p(t) - p^{*}\right) ^{2} \\{} & {} + \left[ p(t)-p^{*}\right] \cdot \left[ \sin ^{2}(p(t) + \gamma ) - \sin ^{2}(p^{*} + \gamma ) \right] \\= & {} -\left( 1-\left| \sin \left( 2\theta (t)\right) \right| \right) \cdot \left( p(t) - p^{*}\right) ^{2} \\\le & {} 0 \end{aligned}$$

where $\left| \sin \left( 2\theta (t)\right) \right| \le 1$, and $\theta (t)$ is a function of t located between $p(t) + \gamma$ and $p^{*} + \gamma$. Let $t_{k} (k=1, 2, \cdots )$ be the time sequence such that $\vert \sin \left( 2\theta (t_{k})\right) \vert = 1$. Clearly, $V(t_{k+1}) \le V(t_{k})$, and on each interval $(t_{k}, t_{k+1})$, $\left| \sin \left( 2\theta (t)\right) \right| < 1$, thus, ${\dot{V}}(t) < 1$ for $t \in (t_{k}, t_{k+1})$. Together with $V(t_{k+1}) \le V(t_{k})$, it yields that $V(t) \rightarrow 0$ as $t \rightarrow +\infty$, i.e., $p(t) \rightarrow p^{*}$. This proves that $p^{*}$ is the global attractor. The proof is complete. $\square$

Property 4

The nmODE (10) has one and only one global attractor.

Proof

By using the Lemma 2, for each $i(1 \le i \le n)$, the equation

$$\begin{aligned} {\dot{y}}_{i} = - y_{i} + \sin ^{2}\left[ y_{i} + \sum ^{m}_{j=1} w^{(1)}_{ij} x_{j} + b_{i} \right] \end{aligned}$$

has one and only one $y_{i}^{*}$ such that $y_{i}(t) \rightarrow y_{i}^{*}$ as $t \rightarrow +\infty$. Denoting $y^{*} = (y_{1}^{*}, \cdots , y_{n}^{*})^{T}$, clearly, $y^{*}$ is the global attractor of the nmODE. This completes the proof. $\square$

With the property that nmODE has one and only one global attractor $y^{*}$ which satisfies

$$\begin{aligned} y^{*} = \sin ^{2}\left[ y^{*} + W^{(1)} x + b \right] , \end{aligned}$$

the nmODE defines a nonlinear mapping from external input x to the attractor $y^{*}$. Figure 5 shows an intuitive illustration for this nmODE nonlinear mapping.

Next, we will develop a learning algorithm for the proposed nmODE.

3.4 The proposed nmLA

In this subsection, we will develop a learning algorithm so that the nmODE can be used to learn some targets.

To clearly describe the learning algorithm, define

$$\begin{aligned} \left\{ \begin{array}{ll} z_{i}(t) = \displaystyle \sum _{j=1}^{n} w_{ij}^{(2)} y_{j}(t)\\ a_{i}(t) = s\left( z_{i}(t) \right) \end{array} \right. \end{aligned}$$

where $W^{(2)} = \left( w^{(2)}_{ij}\right) _{r \times n}$ is another set of learning parameters, s is an activation function, say, softmax. We refer to neurons that represent the output a(t) as decision-making neurons. In nmODE, each memory neuron i produces output $y_i(t)$ independently at each time t, and all memory neurons form a pattern y(t). To give meaning to the pattern, a learning mechanism is necessary to fuse all $y_i(t)$. This is where the decision-making neurons come in, as they play a crucial role in this process. Figure 6 shows an intuitive illustration.

Suppose that we have a target d associated with the input x, and would like to let the nmODE learn the target. This can be described by developing some learning rules for updating $w^{(2)}_{ij}$, $w^{(1)}_{ij}$ and $b_{i}$ such that

$$\begin{aligned} a({\bar{t}}) \approx d \end{aligned}$$

for some time ${\bar{t}}$. We define a cost function $J=J(a({\bar{t}}), d)$ to describe the distance between $a({\bar{t}})$ and d. Next, we drive learning rules for updating $w^{(2)}_{ij}, w^{(1)}_{ij}$ and $b_{i}$.

Denote

$$\begin{aligned} \delta _{i}^{z} = \frac{\partial J}{\partial z_{i}({\bar{t}})}, (i=1, \cdots , n). \end{aligned}$$

Each $\delta _{i}^{z}$ can be computed out directly. It holds that

$$\begin{aligned} \frac{\partial J}{\partial w^{(2)}_{ij}} = \frac{\partial J}{\partial z_{i}({\bar{t}})} \cdot \frac{\partial z_{i}({\bar{t}})}{\partial w^{(2)}_{ij}} = \delta _{i}^{z} \cdot y_{j}({\bar{t}}). \end{aligned}$$

By using the gradient method, it gives a learning rule for updating $w^{(2)}_{ij}$ as

$$\begin{aligned} w^{(2)}_{ij} \longleftarrow w^{(2)}_{ij} - \alpha \cdot \delta _{i}^{z} \cdot y_{j}({\bar{t}}), \end{aligned}$$

(17)

where $\alpha$ is a learning rate. Next, we derive a learning rule for updating $w^{(1)}_{ij}$ and $b_{i}$.

For convenience, denote

$$\begin{aligned} v_{i}(t) = y_{i}(t) + \sum _{j=1}^{m}w^{(1)}_{ij} x_{j} + b_{i}, (i=1, \cdots , n). \end{aligned}$$

Define a Lagrange function:

$$\begin{aligned} L = J + \sum ^{n}_{k=1} \int ^{{\bar{t}}}_{0} \lambda _{k}(t) \cdot \left[ -y_{k}(t) + \sin ^{2}(v_{k}(t)) - {\dot{y}}_{k}(t)\right] dt, \end{aligned}$$

(18)

where $\lambda _{k}(t)(k=1, \cdots , n)$ are the Lagrange multipliers. It gives that

$$\begin{aligned} \frac{\partial L}{\partial w_{ij}^{(1)}}= & {} \frac{\partial J}{\partial w_{ij}^{(1)}} + \sum ^{n}_{k=1} \int ^{{\bar{t}}}_{0} \frac{\partial \left[ -y_{k}(t) + \sin ^{2}(v_{k}(t)) - {\dot{y}}_{k}(t)\right] }{\partial w_{ij}^{(1)}} \cdot \lambda _{k}(t) dt \\= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \frac{\partial J}{\partial y_{i}({\bar{t}})} + \int ^{{\bar{t}}}_{0} \left[ \frac{\partial \left[ -y_{i}(t) + \sin ^{2}(v_{i}(t)) \right] }{\partial y_{i}(t)} \cdot \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \right. \\{} & {} \left. + \frac{\partial \left[ -y_{i}(t) + \sin ^{2}(v_{i}(t)) \right] }{\partial w_{ij}^{(1)}} - \frac{\partial {\dot{y}}_{i}(t)}{\partial w_{ij}^{(1)}} \right] \cdot \lambda _{i}(t) dt \\= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \frac{\partial J}{\partial y_{i}({\bar{t}})} + \int ^{{\bar{t}}}_{0} \left[ -1 + \sin 2v_{i}(t)\right] \cdot \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) dt \\{} & {} + \int ^{{\bar{t}}}_{0} \sin 2v_{i}(t) \cdot x_{j} \cdot \lambda _{i}(t) dt - \int ^{{\bar{t}}}_{0} \frac{\partial {\dot{y}}_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) dt. \\ \end{aligned}$$

Since

$$\begin{aligned} \int ^{{\bar{t}}}_{0} \frac{\partial {\dot{y}}_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) dt= & {} \int ^{{\bar{t}}}_{0} \frac{\partial dy(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) \\= & {} \int ^{{\bar{t}}}_{0} d\left[ \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \right] \cdot \lambda _{i}(t) \\= & {} \left. \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) \right| ^{{\bar{t}}}_{0} - \int ^{{\bar{t}}}_{0} \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot d \lambda _{i}(t) \\= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}({\bar{t}}) - \int ^{{\bar{t}}}_{0} \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot {\dot{\lambda }}_{i}(t) dt \\ \end{aligned}$$

and

$$\begin{aligned} \frac{\partial J}{\partial y_{i}({\bar{t}})} = \sum ^{m}_{j=1} \frac{\partial J}{\partial z_{j}({\bar{t}})} \cdot \frac{\partial z_{j}({\bar{t}})}{\partial y_{i}({\bar{t}})} = \sum ^{m}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)} \end{aligned}$$

then,

$$\begin{aligned} \frac{\partial L}{\partial w_{ij}^{(1)}}= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \left[ \sum ^{r}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)} - \lambda _{i}({\bar{t}}) \right] \\{} & {} + \int ^{{\bar{t}}}_{0} \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \left[ \left( -1 + \sin 2v_{i}(t)\right) \cdot \lambda _{i}(t) + {\dot{\lambda }}_{i}(t) \right] dt \\{} & {} + x_{j} \int ^{{\bar{t}}}_{0} \lambda _{i}(t) \cdot \sin 2v_{i}(t) dt. \end{aligned}$$

Since $\lambda _{k}(t)(k=1, \cdots , n)$ are the Lagrange multipliers, we can take

$$\begin{aligned} \lambda _{i}({\bar{t}}) = \sum ^{r}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)} \end{aligned}$$

and

$$\begin{aligned} {\dot{\lambda }}_{i}(t) = \left[ 1 - \sin 2v_{i}(t)\right] \cdot \lambda _{i}(t) \end{aligned}$$

then, we have

$$\begin{aligned} \frac{\partial L}{\partial w_{ij}^{(1)}} = x_{j} \int ^{{\bar{t}}}_{0} \sin 2v_{i}(t) \cdot \lambda _{i}(t) dt. \end{aligned}$$

Noting that at minimum point, it holds that

$$\begin{aligned} \frac{\partial J}{\partial w_{ij}^{(1)}} = \frac{\partial L}{\partial w_{ij}^{(1)}} = x_{j} \int ^{{\bar{t}}}_{0} \sin 2v_{i}(t) \cdot \lambda _{i}(t) dt. \end{aligned}$$

Define gradient functions by

$$\begin{aligned} \xi _{ij}(t) = x_{j} \int ^{{\bar{t}}}_{t} \sin 2v_{i}(\theta ) \cdot \lambda _{i}(\theta ) d\theta \end{aligned}$$

for $0 \le t \le {\bar{t}}$ and $1 \le i \le n, 1 \le j \le m$, clearly,

$$\begin{aligned} \xi _{ij}(0) = \frac{\partial L}{\partial w_{ij}^{(1)}}, \quad \xi _{ij}({\bar{t}})=0, \end{aligned}$$

and

$$\begin{aligned} {\dot{\xi }}_{ij}(t) = -x_{j} \cdot \sin 2v_{i}(t) \cdot \lambda _{i}(t). \end{aligned}$$

Similarly, define

$$\begin{aligned} \eta _{i}(t) = \int ^{{\bar{t}}}_{t} \sin 2v_{i}(\theta ) \cdot \lambda _{i}(\theta ) d\theta \end{aligned}$$

for $0 \le t \le {\bar{t}}$ and $1 \le i \le n$, it holds that

$$\begin{aligned} \eta _{i}(0) = \frac{\partial L}{\partial b_{i}}, \quad \eta _{i}({\bar{t}})=0, \end{aligned}$$

and

$$\begin{aligned} {\dot{\eta }}_{i}(t) = - \sin 2v_{i}(t) \cdot \lambda _{i}(t). \end{aligned}$$

The function $\xi _{ij}(t)$ and $\eta _{i}(t)$ have a relationship $\xi _{ij}(t) = x_{j} \cdot \eta _{i}(t)$, then, it holds a gradient relation

$$\begin{aligned} \xi _{ij}(0) = x_{j} \cdot \eta _{i}(0). \end{aligned}$$

(19)

This means that the gradient $\xi _{ij}(0)$ can be computed by using the gradient $\eta _{i}(0)$.

From the above analysis, for each i, we can get a three-dimensional differential equation

$$\begin{aligned} \left\{ \begin{array}{lll} {\dot{y}}_{i} = - y_{i} + \sin ^{2}\left( y_{i} + \displaystyle \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right) \\ {\dot{\lambda }}_{i} = \left[ 1 - \sin 2\left( y_{i} + \displaystyle \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right) \right] \cdot \lambda _{i}\\ \dot{\eta _{i}} = - \sin 2\left( y_{i} + \displaystyle \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right) \cdot \lambda _{i} \end{array} \right. \end{aligned}$$

(20)

with initial as

$$\begin{aligned} \left\{ \begin{array}{llll} y_{i}({\bar{t}}) = y_{i}({\bar{t}})\\ \lambda _{i}({\bar{t}}) = \displaystyle \sum ^{r}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)}\\ \eta _{i}({\bar{t}}) = 0 \end{array} \right. \end{aligned}$$

(21)

Inversely solving the above three-dimensional ODE from time ${\bar{t}}$ to time 0 for each i, it gets that $\eta _{i}(0)$. Using the gradient relation (19), $\xi _{ij}(0)$ can be computed out. Then, by using the gradient method, a learning rule is given by

$$\begin{aligned} \left\{ \begin{array}{ccccc} w_{ij}^{(1)} &{} \longleftarrow &{} w_{ij}^{(1)} &{} - &{} \beta \cdot \xi _{ij}(0)\\ \\ b_{i} &{} \longleftarrow &{} b_{i} &{} - &{} \beta \cdot \eta _{i}(0) \end{array} \right. \end{aligned}$$

(22)

where $\beta$ is a learning rate.

To facilitate the description of the algorithm that follows, we define a three-dimensional ODE, called inverse Ordinary differential equation (invODE):

$$\begin{aligned} \left\{ \begin{array}{lll} {\dot{p}} = - p + \sin ^{2}\left( p + \gamma \right) \\ {\dot{\lambda }} = \lambda \cdot \left[ 1 - \sin 2\left( p + \gamma \right) \right] \\ {\dot{\eta }} = - \lambda \cdot \sin 2\left( p + \gamma \right) \end{array} \right. \end{aligned}$$

(23)

for $t \ge 0$. In invODE (23), the first equation is the neuronODE defined in (12), the second equation is the Lagrange multiplier equation, and the third one is the gradient equation. It should be noted here that most existing inverse ODEs for constructing learning algorithms are with very high dimensions and thus not efficient for computing.

As a summary of the above description, a learning algorithm, called neural memory learning algorithm (nmLA), can be given as follows.

Step 1: Input training data set X.

Step 2: Initialize parameters $W^{(1)}, W^{(2)}, b, \alpha , \beta , y(0)$ and ${\bar{t}}$.

Step 3: Select a mini-batch $X_{m} \in X$.

Step 4: For each $x \in X_{m}$, for each i, solve the corresponding one-dimensional ODE in (11) to get $y_{i}({\bar{t}})$.

Step 5: Calculate $\delta ^{z}_{i}$.

Step 6: For each i, solve the three-dimensional invODE to get $\eta _{i}(0)$ and using (19) to compute $\xi _{ij}(0)$.

Step 7: Sum up the gradients $\xi _{ij}(0)$ and $\eta _{i}(0)$, respectly.

Step 8: Use (17) and (22) for updating $w_{ij}^{(1)}$, $w_{ij}^{(2)}$ and $b_{i}$.

Step 9: Back to step 3 until $w_{ij}^{(1)}$, $w_{ij}^{(2)}$ and $b_{i}$ converge.

Step 10: Return $W^{(1)}, W^{(2)}$ and b.

In the nmLA, the learning rates $\alpha$ and $\beta$ control the convergence speed of the connection weights. Choosing these parameters too large or too small can affect the convergence. In practice, they are often chosen by trial and error. The initial y(0) of the nmODE is suggested to be chosen from the invariant set S, such as $y(0)=0$, to ensure fast convergence to the attractor $y^{*}$. While the terminal time ${\bar{t}}$ could also be optimized by learning, nmODE converges very quickly, so it is suggested to choose an appropriate value.

3.5 Biological point of view on nmODE

Mountcastle (Mountcastle 1978, 1997) proposed the innovative idea that cortical columns have uniform architectures and carry out similar functions, suggesting that brain columns share a common cortical algorithm. More recently, Hawkins (Hawkins and Dawkins 2021; Hawkins et al. 2019) proposed another surprising idea that the unit of intelligence in the brain is the cortical column, each of which can learn to model the world by utilizing the endowed reference frame property. Since columns run a common algorithm, it is necessary for multiple columns to work together to make a decision in order to complete a task. Hawkins believed that this process is achieved through voting. He referred to this theory as “A Thousand Brains". If this hypothesis is accurate, it could contribute to a better understanding of brain intelligence.

It is widely recognized that memory plays a crucial role in intelligence. If we accept that the cortical column is the unit of intelligence, it follows that each column must possess a memory mechanism. On the other hand, a column in the neocortex may contain several classes of neurons, such as pyramid neurons, starlike neurons, etc.. Different classes of neurons may have different functions in a column. It may have a class of neurons execute function of memory, we call such neurons as memory neurons. Inspired by the hypothesis of Mountcastle and Hawkins that there is a common algorithm for columns, we propose a hypothesis that there is a common algorithm for memory neurons in a column, further, all the memory neurons have a same dynamical system regardless perception input. As memory neurons in a column run the same dynamical system independently, to get all the memory neurons to make a decision, there should have a mechanism to connect memory neurons, we propose that this mechanism could be achieved by using decision neurons. Figure 7 gives an illustration.

From an artificial intelligence perspective, establishing a mathematical model for columns is crucial for solving practical problems. When building the model, some considerations should be kept in mind, such as: (1) the memory neurons in a column should have the same dynamical system and an embedded memory mechanism, (2) each column should be able to receive perception input and deliver stable output, (3) in the absence of perception inputs, the dynamics of a column should remain in a resting state, (4) the math model should have a simple structure with clear dynamics for analysis, and (5) the math model should have sufficiently high nonlinearity to provide powerful representation capability.

The proposed nmODE seems satisfy all the requirements listed above and can be looked as a math model of column. When a perception input x is received by the column, it will deliver a stable output $y^{*}$ which is the global attractor and can be interpreted as the memory associated with the input x. In case of lacking perception input, i.e., $x=0$, the nmODE can be reduced to ${\dot{y}} = - y + \sin ^{2}\left[ y + b \right]$. It is easy to see that the state will stay in a resting state $y^{*}$ which satisfies $y^{*} = \sin ^{2}\left[ y^{*} + b \right]$. If, in addition, there is no bias, i.e. $b=0$, then $y^{*}=0$. This property can be interpreted as that column keeps silence in lacking of perception input.

The nmODE is a decoupled system, which implies that different neurons have a same function. In fact, the nmODE can be rewritten as

$$\begin{aligned} {\dot{y}}_{i} = -y_{i} + \sin ^{2} \left( y_{i} + \gamma _{i}\right) \end{aligned}$$

where $\gamma _{i}$ can be considered as perception input to the $i^{th}$ neuron. Removing the neuron index and replacing $y_{i}$ by the one dimensional variable p, clearly, each neuron obeys the dynamics of the neuronODE (12), which can be looked as a common dynamical system of all the memory neurons.

4 The proposed $\epsilon$-network

The proposed nmODE is described by a specific ordinary differential equation. In practical applications, sometimes it is not convenient in particular for digit computers to compute. In this section, we propose a discrete version of the proposed nmODE, called $\epsilon$-net, together with an efficient learning algorithm.

4.1 The $\epsilon$-net Structure

The proposed $\epsilon$-net is a discrete version of the nmODE described by an iterative equation as

$$\begin{aligned} y^{l+1} = (1 - \epsilon ) \cdot y^{l} + \epsilon \cdot \sin ^{2}\left[ y^{l} + W^{(1)} x + b \right] \end{aligned}$$

(24)

where $0< \epsilon < 1$ is a parameter, l denotes layer number. Given an initial point $y^{0} \in R^{n}$, starting from $y^{0}$, the set $\{\left. y^{l} \right| l \ge 0\}$ forms a trajectory in $R^{n}$ space. Digit computing for the $\epsilon$-net is just by iterating from layer l to layer $l+1$, Fig. 8 shows an intuitive illustration.

The $\epsilon$-net is in fact a discrete time dynamic system. As in nmODE, the equilibrium point $y^{*}$ is defined by

$$\begin{aligned} y^{*} = \sin ^{2}\left[ y^{*} + W^{(1)} x + b \right] . \end{aligned}$$

Given an input x, clearly, the $\epsilon$-net has one and only one equilibrium point $y^{*}$. The equilibrium point $y^{*}$ is called a global attractor, if for any initial point $y^{0} \in R^{n}$, the trajectory $\{\left. y^{l} \right| l \ge 0\}$ starting from $y^{0}$, converges to $y^{*}$, i.e.,

$$\begin{aligned} y^{l} \longrightarrow y^{*} \end{aligned}$$

as $l \rightarrow +\infty$. If the above global attractor definition holds for all $y^{0} \in R^{n}$ excepting a zero measure set, we call $y^{*}$ an almost global attractor.

In Sect. 3, it proved that nmODE has one and only one global attractor. A question here is that whether the $\epsilon$-net also possesses this property? From the theory of ODEs, in general, an ODE has the property of global attractor cannot guarantee that its discrete version also has this property. For example, the ODE

$$\begin{aligned} {\dot{y}} = -y - y^{3} \end{aligned}$$

has the global attractor property, in fact, $y^{*}=0$ is the global attractor. However, its discrete version

$$\begin{aligned} y^{l+1} = (1-\epsilon ) y^{l} -\epsilon \left( y^{l} \right) ^3 \end{aligned}$$

does not have this property for any $\epsilon \in (0, 1)$. In fact, it can be easily checked that any trajectory starting from $y(0) > 1$ will diverge. Thus, it is necessary to check the global attractor property of the $\epsilon$-net. Fortunately, we can prove that the proposed $\epsilon$-net has, indeed, the global attractor property.

Property 5

Given any input x, the $\epsilon$-net has one and only one almost global attractor.

Proof

Given any input x, suppose that $y^{*}$ is the unique equilibrium of the $\epsilon$-net. For any initial $y^{0}$, we have

$$\begin{aligned} y^{l+1} - y^{*}= & {} (1-\epsilon ) \left[ y^{l} - y^{*}\right] \\{} & {} + \epsilon \cdot \left[ \sin ^{2}\left( y^{l} + W^{(1)} x + b \right) - \sin ^{2}\left( y^{*} + W^{(1)} x + b \right) \right] . \end{aligned}$$

Then, it gives that

$$\begin{aligned} \left| y^{l+1}_{i} - y^{*}_{i}\right|\le & {} (1-\epsilon ) \left| y^{l}_{i} - y^{*}_{i} \right| \\{} & {} + \epsilon \cdot \left| \sin (2\theta _{i}^{l}) \right| \cdot \left| y^{l}_{i} - y^{*}_{i} \right| \\= & {} \delta _{i}^{l} \cdot \left| y^{l}_{i} - y^{*}_{i} \right| \\= & {} \displaystyle \prod _{k=0}^{l} \delta _{i}^{k} \cdot \left| y^{0}_{i} - y^{*}_{i} \right| , \end{aligned}$$

where $\delta _{i}^{l} = 1 - \epsilon \left( 1-\left| \sin (2\theta _{i}^{l})\right| \right)$, and $\theta _{i}^{l}$ is located between $y^{l}_{i} + \sum ^{m}_{j=1}w_{ij} x_{j} + b_{i}$ and $y^{*}_{i} + \sum ^{m}_{j=1}w_{ij} x_{j} + b_{i}$. Since $0 < \delta _{i}^{l} \le 1$, and the measure of the set $S_{i} = \{l \vert \delta _{i}^{l} = 1, l \ge 1\}$ is zero, thus, almost

$$\begin{aligned} \prod _{k=0}^{l} \delta _{i}^{k} \longrightarrow 0 \end{aligned}$$

as $l \rightarrow +\infty$, and almost

$$\begin{aligned} y^{l}_{i} \longrightarrow y^{*}_{i}, (i=1, \cdots , n) \end{aligned}$$

as $l \rightarrow +\infty$. The proof is complete. $\square$

The above proof suggests that the $\epsilon$-net has an almost global attractor property. We hypothesize that it actually possesses the global attractor property, but we are currently unable to prove this.

Similarly as that of the nmODE, the $\epsilon$-net defines a nonlinear mapping from input x to the global attractor $y^{*}$, i.e.,

$$\begin{aligned} \epsilon -\text{ net-mapping }: x \longrightarrow y^{*}. \end{aligned}$$

Figure 9 gives an intuitive illustration for this mapping.

Regarding attractors as memories, the $\epsilon$-net-mapping establishes a relationship between input x and the stored memory $y^{*}$.

4.2 The proposed $\epsilon$-LA

The $\epsilon$-net establishes a nonlinear mapping from input x to the global attractor $y^{*}$ as network’s output. To make this nonlinear mapping defined by $\epsilon$-net be applied in practice, learning algorithm must be developed. In this paper, we propose an efficient learning algorithm, called $\epsilon$-LA for the $\epsilon$-net.

In order to establish a learning algorithm, we define the network output at layer $l+1$ as

$$\begin{aligned} \left\{ \begin{array}{ll} z^{l+1}_{i} = \displaystyle \sum _{j=1}^{n} w_{ij}^{(2)} y_{j}^{l+1}\\ a^{l+1}_{i} = s\left( z^{l+1}_{i} \right) \\ v_{i}^{l} = y_{i}^{l} + \displaystyle \sum _{j=1}^{n}w^{(1)}_{ij} x_{j} + b_{i} \end{array} \right. \end{aligned}$$

(25)

where $z^{l+1} \in R^{r}, a^{l+1} \in R^{r}$, $\left( w^{(2)}_{ij}\right) _{r \times n}$ is a $r \times n$ matrix, s is an activation function which can be selected according to practical applications, for example, in classification problem, the softmax activation function is frequently used. We call neurons that represent $a^{l+1}$ as decision making neurons. Figure 10 gives an illustration for the computation.

Suppose that the network is required to learn a target $d_{i} (i=1, \cdots , r)$. Using the output $a_{i}^{l+1}$ together with the learning target $d_{i}$, a cost function $J^{l+1}$ can be constructed, say, the well known Cross Entropy. Clearly, $J^{l+1}$ is a function of $z_{i}^{l+1}$ and $w^{(2)}_{ij}$. In order to use gradient method, it requires first to calculate out the gradients:

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial w_{ij}^{(2)}}, \quad \frac{\partial J^{l+1}}{\partial w_{ij}^{(1)}}, \quad \frac{\partial J^{l+1}}{\partial b_{i}}. \end{aligned}$$

Define

$$\begin{aligned} \left\{ \begin{array}{ll} \delta _{i}^{l+1}(z) = \displaystyle \frac{\partial J^{l+1}}{\partial z_{i}^{l+1}},\\ \\ \delta _{i}^{l}(v) = \displaystyle \frac{\partial J^{l+1}}{\partial v_{i}^{l}}. \end{array} \right. \end{aligned}$$

The $\delta _{i}^{l+1}(z)$ can be directly calculated out. For example, in case that $d_{1}+\cdots + d_{r} = 1$, s is the softmax function, and $J^{l+1}$ is the Cross Entropy:

$$\begin{aligned} J^{l+1} = -\sum ^{r}_{i=1} d_{i} \cdot \log \left( a_{i}^{l+1}\right) \end{aligned}$$

then, $\delta _{i}^{l+1}(z) = a_{i}^{l+1} - d_{i}$.

Using $\delta _{i}^{l+1}(z)$, we can calculate out the gradient

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial w_{ij}^{(2)}} = \frac{\partial J^{l+1}}{\partial z_{i}^{l+1}} \cdot \frac{\partial z_{i}^{l+1}}{\partial w_{ij}^{(2)}} = \delta _{i}^{l+1}(z) \cdot y_{j}^{l+1}. \end{aligned}$$

(26)

Next, we calculate $\frac{\partial J^{l+1}}{\partial w_{ij}^{(1)}}$ and $\frac{\partial J^{l+1}}{\partial b_{i}}$. We have

$$\begin{aligned} \delta _{i}^{l}(v)= & {} \frac{\partial J^{l+1}}{\partial y_{i}^{l+1}} \cdot \frac{\partial y_{i}^{l+1}}{\partial v_{i}^{l}} \nonumber \\= & {} \epsilon \cdot \sin (2v_{i}) \cdot \frac{\partial J^{l+1}}{\partial y_{i}^{l+1}} \nonumber \\= & {} \epsilon \cdot \sin (2v_{i}) \cdot \sum ^{n}_{j=1}\frac{\partial J^{l+1}}{\partial z_{j}^{l+1}} \cdot \frac{\partial z_{j}^{l+1}}{\partial y_{i}^{l+1}} \nonumber \\= & {} \epsilon \cdot \sin (2v_{i}) \cdot \sum ^{n}_{j=1}w^{(2)}_{ji} \cdot \delta _{j}^{l+1}(z). \end{aligned}$$

(27)

Using

$$\begin{aligned} y_{i}^{l+1} = (1-\epsilon ) y_{i}^{l} + \epsilon \cdot \sin ^{2}(v_{i}^{l}) \end{aligned}$$

and noting that $y_{i}^{0}$ is independent of $w_{ij}^{(1)}x_{j}$ and $b_{i}$, it gives that

$$\begin{aligned} \left\{ \begin{array}{llll} \displaystyle \frac{\partial y_{i}^{l+1}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } = (1-\epsilon ) \frac{\partial y_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } + \epsilon \cdot \sin (2 v_{i}^{l}) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) }\right] \\ \\ \displaystyle \frac{\partial y_{i}^{l+1}}{\partial b_{i}} = (1-\epsilon ) \frac{\partial y_{i}^{l}}{\partial b_{i}} + \epsilon \cdot \sin (2 v_{i}^{l}) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial b_{i}}\right] \\ \\ \displaystyle \frac{\partial y_{i}^{0}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } = 0\\ \\ \displaystyle \frac{\partial y_{i}^{0}}{\partial b_{i}} = 0. \end{array} \right. \end{aligned}$$

(28)

Using $\delta _{i}^{l+1}(z)$ and $\delta _{i}^{l}(v)$, we can calculate the gradient

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial w_{ij}^{(1)}} = \frac{\partial J^{l+1}}{\partial v_{i}^{l}} \cdot \frac{\partial v_{i}^{l}}{\partial w_{ij}^{(1)}} = \delta _{i}^{l}(v) \cdot \frac{\partial v_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } \cdot x_{j} = \delta _{i}^{l}(v) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } \right] x_{j}. \end{aligned}$$

(29)

Similarly, we have

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial b_{i}} = \delta _{i}^{l}(v) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial b_{i}} \right] . \end{aligned}$$

(30)

Then, by using the well known gradient method, it gives a learning rule as

$$\begin{aligned} \left\{ \begin{array}{ccl} w_{ij}^{(2)} &{} \longleftarrow &{} w_{ij}^{(2)} - \alpha \cdot \delta _{i}^{l+1}(z) \cdot y_{j}^{l+1}\\ \\ w_{ij}^{(1)} &{} \longleftarrow &{} w_{ij}^{(1)} - \beta \cdot \delta _{i}^{l}(v) \cdot \left[ x_{j} + \displaystyle \frac{\partial y_{i}^{l}}{\partial w_{ij}^{(1)}}\right] \\ \\ b_{i} &{} \longleftarrow &{} b_{i} - \beta \cdot \delta _{i}^{l}(v) \cdot \left[ 1 + \displaystyle \frac{\partial y_{i}^{l}}{\partial b_{i}}\right] \end{array} \right. \end{aligned}$$

(31)

where $\alpha$ and $\beta$ are the learning rates.

Directly solving the iteration equations (8) and (28) are not easy. Fortunately, these equations are decoupled for their variables. Next, we will define a math mode of a neuron in discrete time as well as an adjoint equation. These equations can be looked as basic functions to describe the learning algorithm.

The math model of a neuron in discrete time equation, called neuronDTE, is defined as

$$\begin{aligned} p^{l+1} = (1-\epsilon ) p^{l} + \epsilon \cdot \sin ^{2} \left( p^{l} + \gamma \right) , \end{aligned}$$

(32)

where $\gamma$ can be looked as perception input to the neuron.

The adjoint equation of neuronDTE, called neuronADE, is defined as

$$\begin{aligned} \left\{ \begin{array}{ll} p^{l+1} = (1-\epsilon ) p^{l} + \epsilon \cdot \sin ^{2} \left( p^{l} + \gamma \right) \\ \\ q^{l+1} = (1-\epsilon ) q^{l} +\epsilon \cdot \sin 2\left( p^{l} + \gamma \right) \cdot \left( 1 + q^{l}\right) , \end{array} \right. \end{aligned}$$

(33)

Clearly, the neuronADE is a two-dimensional equation.

From the above analysis, a learning algorithm, called $\epsilon$-LA, can be described as:

Step 1: Input training data set X.

Step 2: Initialize parameters $W^{(1)}$, $W^{(2)}$, b, $\alpha$, $\beta$, $\epsilon$.

Step 3: Select a mini-batch $X_{m} \subset X$ and set $l \leftarrow 0$.

Step 4: For each $x \in X_{m}$, for each i, calculate $y_{i}^{l+1}$ by neuronDTE (32) and the network output $a_{i}^{l+1}$.

Step 5: Calculate $\delta _{i}^{l+1}\left( z\right)$.

Step 6: Accumulate $\frac{\partial J^{l+1}}{\partial w^{(2)}_{ij}}$ by (26).

Step 7: Calculate $\delta ^{l}_{i}(v)$.

Step 8: Accumulate $\frac{\partial J^{l+1}}{\partial w^{(1)}_{ij}}$ and $\frac{\partial J^{l+1}}{\partial b_{i}}$ by (29) and (30) with $q^{l}_{i}$.

Step 9: Calculate $q^{l+1}$ by neuronADE (33).

Step 10: Update $w_{ij}^{(2)}$, $w_{ij}^{(1)}$, and $b_{i}$ by using (31) with $\frac{\partial J^{l+1}}{\partial w^{(2)}_{ij}}$, $\frac{\partial J^{l+1}}{\partial w^{(1)}_{ij}}$ and $\frac{\partial J^{l+1}}{\partial b_{i}}$.

Step 11: Back to Step 3 with $l \leftarrow l + 1$ until $W^{(1)}$, $W^{(2)}$, and b converge.

The proposed $\epsilon$-LA has two obvious advantages: 1. Unlike traditional feedforward networks, it does not require to design network layer number in advance; 2. There is no problem of the gradient vanishing, since the gradient back propagation happens only in one layer each time. A detailed implementation of the $\epsilon$-LA is given in Appendix.

5 Experiments

In this section, we will conduct experiments to showcase the effectiveness of our proposed theory. The experiments will be divided into three subsections. In the first subsection, we will present the results of our experiments on the widely known MNIST data set for classification. In the second subsection, we will show the results of using real-world ultrasonic image data for breast cancer classification. We will employ the voting principle to train multiple models, and then use these models for voting. In the third subsection, we will evaluate the learning ability of $\epsilon$-LA on unlimited-length sequence data generated by the Embedded Reber Grammar. The experiments are implemented by PyTorch using NVIDIA RTX 3090 GPUs in an Ubuntu server.

5.1 Experiment on MNIST data

5.1.1 Classification results of nmODE

We first carry out the experiment of nmODE on the MNIST dataset, a well-known benchmark containing the handwritten digits from 0 to 9. There are 50, 000 and 10, 000 images are contained in the training set and testing set, respectively. Each image is in the size of $28 \times 28$. Five nmODEs are independently trained by using the nmLA. The accuracy of each network and the voting results on the testing set are summarized in Table 1. It can be seen in the table, all nmODEs achieve competitive accuracy over 99.52%. By using the voting mechanism among the five models, the accuracy reaches 99.71%.

Table 1 Accuracy of the nmODE on the testing set of MNIST dataset

Full size table

For the classification task on the MNIST dataset, previous neural differential equations such as SONODEs (Norcliffe et al. 2020), ANODEs (Dupont et al. 2019), and NODEs (Chen et al. 2018) use CNN blocks in the architecture, while the proposed nmODEs directly input the image. As shown in Table 2, nmODEs achieve the highest accuracy among all the other neural differential equations without any CNN layers, confirming the strong effectiveness of nmODEs.

Table 2 nmODE comparison to other neuralODEs on the testing set of MNIST dataset

Full size table

The number of function evaluations (NFE) is a measure of the computational cost of an optimization algorithm, which directly affects the time and resources required to find an optimal solution. As shown in Table 2, nmODEs are able to achieve the highest accuracy with the lowest NFE among all the other neural differential equations, which reflects the fast speed of nmODEs.

5.1.2 Classification results of $\epsilon$-net

We further conduct experiments on the MNIST dataset by using the proposed $\epsilon$-net. Ten $\epsilon$-nets are independently trained by using the $\epsilon$-LA algorithm. The accuracy of each network and the voting result on the testing set are summarized in Table 3. As can be seen in the table, the majority $\epsilon$-nets achieve competitive accuracy over 98.4%, except for the 98.39% of the fourth model. By voting among the ten models, the accuracy is raised to 98.85%, confirming the effectiveness of the voting mechanism.

Table 3 Accuracy of the $\epsilon$-net on the testing set of MNIST dataset

Full size table

5.1.3 Visualization of attractors

We have also created a visualization of the attractors for the MNIST dataset, which can be seen in Fig. 11. To produce this visualization, we used a network consisting of two nmODEs, with 2048 and 64 memory neurons, respectively. The attractors generated by the second nmODE were reshaped to $8\times 8$ dimensions to make them suitable for visualization. As depicted in Fig. 11, the attractors belonging to the same class are very similar, while those belonging to different classes are markedly distinct. These results demonstrate that the proposed nmODE has effectively captured the essential characteristics of the ten classes in the MNIST dataset.

In addition to the attractor visualization, we have also included t-SNE results for the MNIST dataset in Fig. 12.

It is evident that samples belonging to the same class are closely clustered together, while those belonging to different classes are clearly separated.

5.2 Experiment on CIFAR-10 data

We evaluated the proposed nmODE on the CIFAR-10 dataset, which is a widely used dataset for computer vision tasks. It comprises 60,000 32 × 32 color images in 10 different classes, including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class consists of 6000 images, out of which 5,000 are used for training and 1,000 for testing. The dataset is designed to evaluate image classification algorithms, with the aim of accurately assigning each image to one of the ten classes.

For the classification task on the CIFAR-10 dataset, previous neural differential equation models such as ANODEs (Dupont et al. 2019) use CNN blocks in the architecture to extract features from $32 \times 32 \times 3$ RGB images. In contrast, the proposed nmODEs directly flatten the RGB image to 3072. Table 4 shows that nmODEs outperform ANODEs by achieving $70.73\%$ accuracy, without the use of any CNN blocks.

Table 4 nmODE comparison to ANODE on the testing set of CIFAR-10 dataset

Full size table

To further evaluate the proposed nmODE, we conducted experiments on the CIFAR-10 dataset using both nmODE and ViT-B/16 (Dosovitskiy et al. 2021). The network architecture comprised two parts: an ImageNet-pretrained ViT-B/16 that downscale the image from $32 \times 32 \times 3$ to 512, and the proposed nmODEs. As shown in Table 5, nmODEs outperform ViT-B/16 by achieving $98.73\%$ accuracy on the CIFAR-10 dataset.

Table 5 nmODE comparison to ViT-B/16 on the testing set of CIFAR-10 dataset

Full size table

5.3 Experiment on breast cancer classification

We evaluated the proposed nmODE and $\epsilon$-net on a large-scale dataset for breast cancer classification. The task was to classify ultrasound images into medical assessment categories named Breast Imaging-Reporting And Data System (BI-RADS), including B1, B2, B3, B4a, B4b, B4c, and B5. The dataset consisted of a significant number of training and test images, as summarized in Table 6. We trained ten nmODEs and $\epsilon$-nets independently, each with a different number of hidden neurons.

The network architecture consisted of two parts. The first part was an ImageNet pretrained EfficientNet-B3 (Tan and Le 2019) that downscaled the image from $256 \times 256 \times 3$ to 1536. The second part was either the proposed nmODE or $\epsilon$-net.

The number of ultrasound images in the training and testing sets are summarized in Table 6.

Table 6 Number of ultrasound images in the training and testing sets

Full size table

5.3.1 Classification results

The experiment results for the nmODE and $\epsilon$-net are presented in Table 7. As seen in the table, the nmODE achieved an accuracy of $83.42\%$, and the $\epsilon$-net achieved an accuracy of $80.97\%$. All models had less than 1 M parameters and performed well, approaching human diagnostic ability.

Table 7 Accuracies of the $\epsilon$-net and nmODE on the testing set of Ultrasound dataset

Full size table

5.3.2 Visualization of Attractors

We also visualized the attractors of the $\epsilon$-net for the Ultrasound dataset, as depicted in Fig. 13.

The network used for visualization consisted of a MobileNetV3, and the proposed $\epsilon$-net with hidden neurons of 256. Each attractor $y^{*}$ was reshaped to $16\times 16$ dimensions for visualization. From the visualization results, it is clear that the attractors for the same class are quite similar, while those belonging to different classes are distinct.

In addition to the visualization of attractors, we also included t-SNE results for the Ultrasound dataset in Fig. 14.

These visualization results demonstrate that the proposed $\epsilon$-net has learned the essential characteristics of BI-RADS categories.

5.4 Experiment on sequence learning

In this subsection, we evaluated the proposed $\epsilon$-net and $\epsilon$-LA on an unlimited-length sequence learning task. The $\epsilon$-LA has the advantage of not requiring any prior knowledge of the number of network layers, allowing it to learn unlimited-length sequences. It should be noted that to enable the $\epsilon$-LA to learn sequences, the input x to the $\epsilon$-net must be variable along l, i.e., $x=x^{l}$, where $x^{l}$ is the l-th element in the input sequence. Then, the $\epsilon$-net can be rewritten as:

$$\begin{aligned} y^{l+1} = (1 - \epsilon ) \cdot y^{l} + \epsilon \cdot \sin ^{2}\left[ y^{l} + W^{(1)} x^{l} + b \right] . \end{aligned}$$

(34)

It can be easily proven from Subsection 4.2 that the $\epsilon$-LA still holds for the above situation.

In this study, we used a toy sequence learning benchmark, the Embedded Reber Grammar (ERG) (Hochreiter and Schmidhuber 1997; Wang et al. 2018). The ERG is a sequence generator that follows the grammar rules illustrated in Fig. 15.

The $\epsilon$-net faced two challenges in learning the ERG-generated sequences. Firstly, the sequence generated by the ERG may be of unlimited length due to the presence of inter-loops in the ERG graph. Secondly, the prediction of the subsequent "T" or "P" in red color depends on the preceding symbol, which requires the ability to learn long-term memory.

Dataset and implementation: In this experiment, 10,000 distinct sequences were randomly generated using the ERG, which were divided into two sets with $70\%$ used for training and the remaining $30\%$ for testing. To represent each symbol in the generated sequence, a 7-dimensional one-hot vector was used. The batch size for the training process was set at 64, and the learning rate was set at 0.001.

Experimental result: Various recurrent neural networks, such as LSTM (Hochreiter and Schmidhuber 1997), AMU-RNN Wang et al. (2018), and SRNN (He et al. 2022), have been used for this simple sequence learning task. Notably, the $\epsilon$-net achieved $100\%$ accuracy with only 240 trainable parameters, whereas LSTM, AMU-RNN, and SRNN required 264, 883, and 675 parameters, respectively. The above findings demonstrate that nmODE possesses long-term memory abilities.

6 Discussions and conclusions

A theory of memory based neuralODE of artificial neural networks has been proposed in this paper. It comprises of two network models and two learning algorithms. One of the common problems of using neuralODEs in practical application is that the computation is very time-consuming, and sometimes even impossible if the neuralODEs are in high dimension. Existing neuralODEs usually use CNN blocks to transform high dimensional data into low dimensional data before fed into neuralODEs. The proposed nmODE, $\epsilon$-net, nmLA, and $\epsilon$-LA are built on the blocks of neuronODE (one-dimension), invODE (three-dimension), neuronDTE (one-dimension), and neuronADE (two-dimension), respectively. Since all of these blocks are in lower dimensions, they can be easily computed. This paper presented all the computations in serial form for clarity, nevertheless, all of them can be reformed in parallel computing. Experiments on classification tasks as well as a simple sequence learning task have been carried out in this paper to demonstrate the excellent performance of the proposed theory. Obviously, the methods of this paper could be further developed to solve other problems on artificial intelligence, such as segmentation, clustering, etc..

The theory presented in this paper could be further generalized in several aspects. Some of the possible generalizations are listed as follows:

1.
The neuronODE could be generalized to be with more higher nonlinearity, say, one can define neuronODE as
$$\begin{aligned} {\dot{p}} = -p + f_{k}\left[ p + f_{k-1} \left[ p + \cdots + f_{1} \left[ p + \gamma \right] \cdots \right] \right] \end{aligned}$$
(35)
for $t \ge 0$ and $k \ge 1$. An example for $k=2$ could be
$$\begin{aligned} {\dot{p}} = -p + \sin ^{2}\left[ p + \cos \left[ p + \gamma \right] \right] . \end{aligned}$$
(36)
The newly defined neuronODE can be used as block to construct more general nmODE as:
$$\begin{aligned} {\dot{y}} = - y + f_{k}\left[ y + f_{k-1} \left[ y + \cdots + f_{1} \left[ y + W^{(1)} x + b \right] \cdots \right] \right] . \end{aligned}$$
(37)
We denote this model as nmODE$^{k}$. If $k=1$ and $f_{1}(\cdot ) = \sin ^{2}(\cdot )$, it reduces to the nmODE.
2.
The neuronODE could be used to replace neurons in traditional feedforward neural networks or recurrent neural networks to build more powerful artificial neural networks. An example of two-dimensional generalization could be as
$$\begin{aligned} \left\{ \begin{array}{llll} {\dot{y}}_{1} = -y_{1} + f\left( {\tilde{w}}_{11} y_{1} + {\tilde{w}}_{12}y_{2} + b_{1} \right) \\ \\ {\dot{y}}_{2} = -y_{2} + f\left( {\tilde{w}}_{21} y_{2} + {\tilde{w}}_{22}y_{2} + b_{2} \right) \\ \\ y_{1}(0) = p_{1}(t)\\ \\ y_{2}(0) = p_{2}(t) \end{array} \right. \end{aligned}$$
together with
$$\begin{aligned} \left\{ \begin{array}{ll} {\dot{p}}_{1} = -p_{1} + g\left( p_{1} + W_{1}^{T} \cdot x \right) \\ \\ {\dot{p}}_{2} = -p_{2} + g\left( p_{1} + W_{2}^{T} \cdot x \right) \end{array} \right. \end{aligned}$$
where x represents the external input, f and g are activation functions. Associated learning algorithms could be developed accordingly.
3.
The nmODE uses external input to input data, which cannot be used for isomerous data. If there are two isomerous data sets, both x and y(0) could be used for data input.
4.
The nmODE could be generalized to more complex networks by stacking nmODEs hierarchically to build networks with stronger representation capabilities. In this paper, we use the $\sin ^{2}(\cdot )$ function as the activation function for nmODE, but other activation functions can also be used. If building a network by stacking nmODEs, it is recommended to use different activation functions for different nmODEs.
5.
The $\epsilon$-LA learning algorithm is presented in discrete-time equations and does not require prior knowledge of the number of network layers. It could be generalized to an online learning algorithm in continuous form by using ODEs.
6.
The neuronODE proposed in this paper can be generalized to a spike neuron dynamical model, which can be used to build novel spiking neural network models.
7.
The neuronODE as well as the invODE are small ODEs with simple structure, they could be implemented by using electric circuits to speed up network training.
8.
The nmODE possesses the capability to preserve long-term memory and could be leveraged for constructing pre-training of neural network models.

More research topics that could further generalize the theory presented in this paper can be listed out. In addition, the math model of neuron defined in this paper generalized the traditional model of neuron from a simple additive function to a dynamical system. The proposed neuronODE could be integrated to many well-known deep learning networks such as GAN networks, reinforcement learning networks, transformer networks, etc., expecting to achieve further good performance of the networks. It seems that we have opened a door for studying nmODE. We believe that the theory presented in this paper will find more successful practical applications in future.

References

Chang B, Meng L, Haber E et al (2018) Multi-level residual networks from dynamical systems view. In: International conference on learning representations. https://openreview.net/forum?id=SyJS-OgR-
Chang B, Chen M, Haber E et al (2019) AntisymmetricRNN: A dynamical system view on recurrent neural networks. In: International conference on learning representations. https://openreview.net/forum?id=ryxepo0cFX
Chen RTQ, Rubanova Y, Bettencourt J et al (2018) Neural ordinary differential equations. In: Bengio S, Wallach H, Larochelle H et al (eds) Advances in neural information processing systems, vol 31. Curran Associates Inc
Google Scholar
Chen RTQ, Amos B, Nickel M (2021) Learning neural event functions for ordinary differential equations. In: International conference on learning representations, https://openreview.net/forum?id=kW_zpEmMLdP
Cohen MA, Grossberg S (1983) Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans Syst Man Cybern SMC 13(5):815–826. https://doi.org/10.1109/TSMC.1983.6313075
Article MathSciNet MATH Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Dupont E, Doucet A, Teh YW (2019) Augmented neural odes. In: Wallach H, Larochelle H, Beygelzimer A et al (eds) Advances in neural information processing systems, vol 32. Curran Associates Inc
Google Scholar
Grossberg S (2013) Recurrent neural networks. Scholarpedia 8(2):1888. https://doi.org/10.4249/scholarpedia.1888
Article Google Scholar
Haber E, Ruthotto L (2017) Stable architectures for deep neural networks. Inverse Prob 34(1):014004. https://doi.org/10.1088/1361-6420/aa9a90
Article MathSciNet MATH Google Scholar
Hawkins J, Dawkins R (2021) A thousand brains: a new theory of intelligence. Basic Books
Google Scholar
Hawkins J, Lewis M, Klukas M et al (2019) A framework for intelligence and cortical function based on grid cells in the neocortex. Front Neural Circ. https://doi.org/10.3389/fncir.2018.00121
Article Google Scholar
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778, https://doi.org/10.1109/CVPR.2016.90
He T, Mao H, Yi Z (2022) Subtraction gates: another way to learn long-term dependencies in recurrent neural networks. IEEE Trans Neur Netw Learn Syst 33(4):1740–1751. https://doi.org/10.1109/TNNLS.2020.3043752
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput, pp 1735–1780
Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J Physiol 117(4):500–544. https://doi.org/10.1113/jphysiol.1952.sp004764
Article Google Scholar
Hopfield JJ (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proc Natl Acad Sci 81(10):3088–3092. https://doi.org/10.1073/pnas.81.10.3088
Article MATH Google Scholar
Kidger P (2021) On neural differential equations. PhD thesis, University of Oxford
Liu D, Wei Q, Wang D et al (2017) Adaptive dynamic programming with applications in optimal control: advances in industrial control. Springer International Publishing
Book Google Scholar
Mountcastle VB (1978) An organizing principle for cerebral function: the unit model and the distributed system. In: Edelman GM, Mountcastle VV (eds) The mindful brain. MIT Press, Cambridge, pp 7–50
Google Scholar
Mountcastle VB (1997) The columnar organization of the neocortex. Brain 120(4):701–722. https://doi.org/10.1093/brain/120.4.701
Article Google Scholar
Norcliffe A, Bodnar C, Day B et al (2020) On second order behaviour in augmented neural odes. In: Larochelle H, Ranzato M, Hadsell R et al (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc, pp 5911–5921
Google Scholar
Poucet B, Save E (2005) Attractors in memory. Science 308(5723):799–800. https://doi.org/10.1126/science.1112555
Article Google Scholar
Sitzmann V, Martel J, Bergman A et al (2020) Implicit neural representations with periodic activation functions. In: Larochelle H, Ranzato M, Hadsell R et al (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc, pp 7462–7473
Google Scholar
Tan M, Le Q (2019) EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, Proceedings of machine learning research, vol 97. PMLR, pp 6105–6114
Wang J, Zhang L, Guo Q et al (2018) Recurrent neural networks with auxiliary memory units. IEEE Trans Neur Netw Learn Syst 29(5):1652–1661. https://doi.org/10.1109/TNNLS.2017.2677968
Article MathSciNet Google Scholar
Wills TJ, Lever C, Cacucci F et al (2005) Attractor dynamics in the hippocampal representation of the local environment. Science 308(5723):873–876. https://doi.org/10.1126/science.1108905
Article Google Scholar
Yi Z (2010) Foundations of implementing the competitive layer model by Lotka-volterra recurrent neural networks. IEEE Trans Neur Netw 21(3):494–507. https://doi.org/10.1109/TNN.2009.2039758
Article Google Scholar
Yi Z, Tan KK (2004) Convergence analysis of recurrent neural networks, network theory and applications, vol 13. Kluwer Academic Publishers, Boston. https://doi.org/10.1007/978-1-4757-3819-3
Book Google Scholar
Yi Z, Tan KK, Lee TH (2003) Multistability analysis for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Comput 15(3):639–662. https://doi.org/10.1162/089976603321192112
Article MATH Google Scholar
Yi Z, Zhang L, Yu J et al (2009) Permitted and forbidden sets in discrete-time linear threshold recurrent neural networks. IEEE Trans Neur Netw 20(6):952–963. https://doi.org/10.1109/TNN.2009.2014373
Article Google Scholar
Zhang L, Yi Z, Si Amari (2018) Theoretical study of oscillator neurons in recurrent neural networks. IEEE Trans Neur Netw Learn Syst 29(11):5242–5248. https://doi.org/10.1109/TNNLS.2018.2793911
Article MathSciNet Google Scholar
Zhang T, Yao Z, Gholami A et al (2019) Anodev2: a coupled neural ode framework. In: Wallach H, Larochelle H, Beygelzimer A et al (eds) Advances in Neural Information Processing Systems, vol 32. Curran Associates Inc
Google Scholar

Download references

Acknowledgements

The author would like to thank some of my students: Haiying Luo, Tao He, Junjie Hu, Quan Guo, Jianyong Wang, Xiuyuan Xu, and Shangming Yang for carrying out experiments and helpful discussions.

Funding

This work was supported by the National Major Science and Technology Projects of China under Grant No. 2018AAA0100201.

Author information

Authors and Affiliations

Intelligent Interdisciplinary Research Center and College of Computer Science, Sichuan University, Chengdu, 610065, People’s Republic of China
Zhang Yi

Authors

Zhang Yi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhang Yi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: the nmLA learning algorithms

Appendix B: the $\epsilon$-LA learning algorithms

Appendix C: on inverse solving of ODEs

Some readers may not be familiar with the inverse solving of ODEs. In this section, I would like to give a simple discussion on how to inversely solve ODEs.

Given an ODE

$$\begin{aligned} {\dot{y}} = f(y). \end{aligned}$$

Suppose ${\bar{t}} > 0$, solving the ODE from $t=0$ to ${\bar{t}}$ is called forward direction solving, while solving it from ${\bar{t}}$ to $t=0$ is called inverse solving.

Let us consider a simple example:

$$\begin{aligned} {\dot{y}} = -1 \end{aligned}$$

which can be rewritten as

$$\begin{aligned} \frac{dy(t)}{dt} = -1. \end{aligned}$$

Given ${\bar{t}}=10$ and $y({\bar{t}}) = -10$, we would like to inversely solve this ODE from $t ={\bar{t}}$ to $t=0$ in order to get y(0). Two steps are required as follows:

Step 1: Define a new variable $s = {\bar{t}} - t$, then the ODE can be rewritten as

$$\begin{aligned} \frac{dy({\bar{t}} - s)}{ds} = 1. \end{aligned}$$

Denote ${\tilde{y}}(s) = y({\bar{t}} - s)$, we have

$$\begin{aligned} \frac{d{\tilde{y}}(s)}{ds} = 1. \end{aligned}$$

Step 2: Integrate both sides from 0 to ${\bar{t}}$

$$\begin{aligned} \int ^{{\bar{t}}}_{0} \frac{d{\tilde{y}}(s)}{ds} ds = {\bar{t}} - 0, \end{aligned}$$

it gives that ${\tilde{y}}({\bar{t}}) - {\tilde{y}}(0) = {\bar{t}}$, i.e., $y(0) - y({\bar{t}}) = {\bar{t}}$, and so $y(0)=y({\bar{t}}) + {\bar{t}}=0$.

Now consider to inversely solve the invODE:

$$\begin{aligned} \left\{ \begin{array}{lll} {\dot{p}} = - p + \sin ^{2}\left( p + \gamma \right) \\ {\dot{\lambda }} = \lambda \cdot \left[ 1 - \sin 2\left( p + \gamma \right) \right] \\ {\dot{\eta }} = - \lambda \cdot \sin 2\left( p + \gamma \right) \end{array} \right. \end{aligned}$$

(C1)

from ${\bar{t}}$ to 0. First, we make a change of variable $s = {\bar{t}} -t$. Then, we have

$$\begin{aligned} \left\{ \begin{array}{lll} \frac{dp({\bar{t}}-s)}{ds} = p({\bar{t}}-s) - \sin ^{2}\left( p({\bar{t}}-s) + \gamma \right) \\ \frac{d\lambda ({\bar{t}}-s)}{ds} = -\lambda ({\bar{t}}-s) \cdot \left[ 1 - \sin 2\left( p({\bar{t}}-s) + \gamma \right) \right] \\ \frac{\eta ({\bar{t}}-s)}{dt} = \lambda ({\bar{t}}-s) \cdot \sin 2\left( p({\bar{t}}-s) + \gamma \right) \end{array} \right. \end{aligned}$$

(C2)

Define

$$\begin{aligned} \left\{ \begin{array}{lll} {\tilde{p}}(s) = p({\bar{t}}-s)\\ {\tilde{\lambda }}(s) = \lambda ({\bar{t}}-s)\\ {\tilde{\eta }}(s) = \eta ({\bar{t}}-s) \end{array} \right. \end{aligned}$$

then, we have

$$\begin{aligned} \left\{ \begin{array}{lll} {\tilde{p}}({\bar{t}}) = p(0)\\ {\tilde{\lambda }}({\bar{t}}) = \lambda (0)\\ {\tilde{\eta }}({\bar{t}}) = \eta (0). \end{array} \right. \end{aligned}$$

The above ODE can be rewritten as

$$\begin{aligned} \left\{ \begin{array}{lll} \dot{{\tilde{p}}} = {\tilde{p}} - \sin ^{2}\left( {\tilde{p}} + \gamma \right) ,\\ \dot{{\tilde{\lambda }}} = -{\tilde{\lambda }} \cdot \left[ 1 - \sin 2\left( {\tilde{p}} + \gamma \right) \right] ,\\ \dot{{\tilde{\eta }}} = {\tilde{\lambda }} \cdot \sin 2\left( {\tilde{p}} + \gamma \right) . \end{array} \right. \end{aligned}$$

(C3)

Solving this ODE in forward direction from $s=0$ to $s={\bar{t}}$, we can get ${\tilde{p}}({\bar{t}}), {\tilde{\lambda }}({\bar{t}}), {\tilde{\eta }}({\bar{t}})$, as so, you get $p(0), \lambda (0), \eta (0)$ are desired.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yi, Z. nmODE: neural memory ordinary differential equation. Artif Intell Rev 56, 14403–14438 (2023). https://doi.org/10.1007/s10462-023-10496-2

Download citation

Published: 22 May 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10462-023-10496-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

nmODE: neural memory ordinary differential equation

Abstract

Similar content being viewed by others

Associative Memory Networks with Multidimensional Neurons

Bio-inspired Neural Networks

Bio-inspired Neural Networks

Explore related subjects

1 Introduction

2 Terminologies and problem formulation

3 Neural memory ordinary differential equations

3.1 On the one-layer network

3.2 An implicit mapping equation

Lemma 1

Proof

Property 1

Proof

3.3 The proposed nmODE

Property 2

Proof

Property 3

Lemma 2

Proof

Property 4

Proof

3.4 The proposed nmLA

3.5 Biological point of view on nmODE

4 The proposed \(\epsilon\)-network

4.1 The \(\epsilon\)-net Structure

Property 5

Proof

4.2 The proposed \(\epsilon\)-LA

5 Experiments

5.1 Experiment on MNIST data

5.1.1 Classification results of nmODE

5.1.2 Classification results of \(\epsilon\)-net

5.1.3 Visualization of attractors

5.2 Experiment on CIFAR-10 data

5.3 Experiment on breast cancer classification

5.3.1 Classification results

5.3.2 Visualization of Attractors

5.4 Experiment on sequence learning

6 Discussions and conclusions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: the nmLA learning algorithms

Appendix B: the \(\epsilon\)-LA learning algorithms

Appendix C: on inverse solving of ODEs

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation