1 Introduction

Human brain contains a lot of connected neurons, which establish a neural network within the brain. It is believed that intelligence is originated from this neural network. Electricity, interpreted as information, flows in the brain neural network and forms a dynamical system. Memories are regarded as attractors of the dynamical system of neural activity (Poucet and Save 2005; Wills et al. 2005). Mathematically, ordinary differential equations (ODEs) can be used to describe dynamical systems. Thus, brain neural networks could be modeled by ODEs to some extent. The history of using ODEs to model neural networks could be traced back to an early time. For example, the famous Hodgkin-Huxley model (Hodgkin and Huxley 1952), a model of the squid giant axon appeared in 1952. A lot of research works have been reported on using ODEs for studying neural networks, see, for examples, Cohen and Grossberg (1983); Grossberg (2013); Hopfield (1984); Zhang et al. (2018); Yi et al. (2003); Yi and Tan (2004); Yi et al. (2009); Yi (2010); Liu et al. (2017).

In recent years, deep learning has achieved great success in many practical applications. Various deep learning models have been reported. One of the well-known deep learning models is ResNet (He et al. 2016). Recent studies have shown that ResNet could be interpreted as a discrete version of some ODEs (Haber and Ruthotto 2017). Since then, increasing interests have been brought in using ODEs to study problems arising from deep learning. In Chen et al. (2018), the concept of neuralODE was proposed to describe a continuous version of the ResNet by using ODE, which showed that the neuralODE has several benefits such as memory efficiency, adaptive computation, parameter efficiency, etc. In our view, any ODE used to model a neural network can be referred to as a neuralODE. In Chang et al. (2018), a neuralODE model with stable architectures was proposed to solve the well-posed learning problem for arbitrarily deep networks by ensuring the stability of the neuralODE. In Chang et al. (2019), a neuralODE model with antisymmetric structure was proposed. It also showed that this model exhibits much more predictable dynamics. In Dupont et al. (2019), authors proved that neuralODEs using initial value as data input can learn only features homeomorphic to the input data space, some simple functions were provided to show the limitation of neuralODEs. To address this problem, augmented neuralODEs were proposed by adding new augments to the networks (Dupont et al. 2019). However, additional computations should be required for the augmentation. More other recent research work on neuralODEs can be found in Zhang et al. (2019); Kidger (2021); Chen et al. (2021).

In general, neuralODEs in their general form are not particularly useful for applications. Carefully designed neuralODEs with simple structures and clear dynamics are always desirable from an application point of view. In the field of artificial neural networks, an important task is to use neural networks to represent complex nonlinear mappings. In the past decades, various artificial neural networks with different representation abilities have been proposed. Studies show that the representation ability of the one-layer neural networks is limited. For example, a one-layer neural network using non-decreasing activation functions, such as sigmoid or ReLU activation functions, cannot be used to represent the well-known XOR mapping, which means it cannot solve the XOR nonlinear classification problem. This is because the nonlinearity of one-layer networks is insufficient. To increase the representation ability of neural networks, their nonlinearity must be enhanced. One way to achieve this is by adding layers, i.e., using multiple layers to build networks. However, a drawback of this method is that the networks may contain a large number of parameters.

In this paper, we propose two methods to enhance the nonlinearity of artificial neural networks. The first method involves using the \(\sin ^{2}(\cdot )\) function as an alternative activation function, rather than nondecreasing functions. In a previous study Sitzmann et al. (2020), the \(\sin (\cdot )\) function was suggested as an activation function for neural networks, and the authors showed several advantages of using this function. The \(\sin ^{2}(\cdot )\) function inherits some of the advantages of the \(\sin (\cdot )\) function, and in addition, has even higher nonlinearity. Therefore, neural networks using this activation function can more efficiently improve their nonlinearity.

The second method involves the development of an implicit mapping equation by modifying the model of one-layer networks in an interesting way, resulting in a nonlinear mapping with higher nonlinearity. We will demonstrate that for each input, the implicit mapping equation has a unique solution, thereby establishing a mapping from input to its associated unique solution.

Straightly solving the implicit mapping equation is impossible, this paper proposes an ODE to indirectly solve it. The proposed ODE has an important property: it has a global attractor for each external input. This attractor is exactly the unique solution of the mentioned above implicit mapping equation. Regarding the attractor as memory, the proposed ODE can be looked as a memory based artificial neural network and we simply call it nmODE. The nmODE is designed with a particular simple structure, which separates learning neurons from memory neurons, and thus endowed with quite clear dynamics. Moreover, different from most existing neuralODEs that use ODE’s initial value as data input, we use an external input variable in nmODE to input data. Thus, the nmODE does not have the problem of learning features homeomorphic to the input data space occurred frequently in most existing neuralODEs (Dupont et al. 2019). Given an external input, the nmODE maps the input to its global attractor, thus, nmODE establishes a mapping from input data space to memory space.

The nmODE can be viewed as a mathematical model of columns in the neocortex. Recent research in theoretical neuroscience suggests that a column may function as a unit of intelligence, and that there may be a common algorithm for all columns (Hawkins and Dawkins 2021). A column contains many neurons belonging to different classes. Since memory plays a crucial role in intelligence, we propose a hypothesis that there is a class of neurons for memory within a column, which we call memory neurons. Furthermore, we propose that memory neurons share a common function, which we can implement by defining a mathematical model for each neuron using a one-dimensional ODE, called neuronODE. The dynamical properties of the neuronODE can be considered the common function of memory neurons. The nmODE can be seen as built upon the neuronODE block.

A learning algorithm called nmLA is developed for the nmODE, which involves an interesting three-dimensional inverse ordinary differential equation called invODE, used for computing gradients of the cost function. The invODE has a low dimension of only three, making it easy to solve. This is unlike most existing neuralODEs that use high-dimensional inverse ODEs, which are inefficient for computing. Moreover, the nmLA does not suffer from the gradient vanishing problem, which occurs frequently in traditional neural networks.

Both nmODE and nmLA are described by continuous equations, which are not particularly feasible for digital computing. In this paper, we propose a discrete version of nmODE called \(\epsilon\)-net, along with a corresponding learning algorithm called \(\epsilon\)-LA. Both \(\epsilon\)-net and \(\epsilon\)-LA are designed for easy digital computing. We propose neuronDTE, a discrete time equation math model for building \(\epsilon\)-net, and propose neuronADE, a two-dimensional equation, to facilitate computing gradients in the \(\epsilon\)-LA algorithm. These two functions make computing much more efficient.

The main contributions of this paper can be summarized as follows:

  1. 1.

    We propose an interesting neural network, called the nmODE, which has clear dynamics and a global attractor property. It incorporates a memory mechanism using attractors in the network, establishing a relationship between external input and memory. The structure of the nmODE is unique in that it separates learning from memory, with each learning connection only existing from an input neuron to a memory neuron. No learning connections exist between two memory neurons. We also propose a math model of the neuron, called the neuronODE, which the nmODE is built upon.

  2. 2.

    An efficient learning algorithm called nmLA is developed, which uses an interesting three-dimensional ODE called invODE to compute gradients. This approach avoids the gradient vanishing problem and enables efficient computation.

  3. 3.

    We propose a discrete version of the nmODE, called \(\epsilon\)-net, which has several desirable dynamical properties, including the global attractor property. We define a math model of the neuron described in a discrete time equation, called neuronDTE, which can be used as a building block for the \(\epsilon\)-net.

  4. 4.

    A learning algorithm called \(\epsilon\)-LA is proposed for the \(\epsilon\)-net. Unlike most feedforward neural networks, the \(\epsilon\)-LA does not require the number of network layers to be predetermined. By proposing a two-dimensional equation, called neuronADE, to facilitate gradient computation, \(\epsilon\)-LA achieves high efficiency. Moreover, being a forward computing algorithm, \(\epsilon\)-LA does not suffer from the gradient vanishing problem.

The organization of the rest of this paper is as follows: Sect. 2 presents the terminologies and problem formulation. Section 3 proposes the nmODE and nmLA, starting with an analysis of the representation ability of a one-layer network to construct an implicit mapping equation, which leads to the development of the nmODE. The learning algorithm, nmLA, is also presented in this section, along with a brief review of nmODE from a biological point of view. Section 4 presents the \(\epsilon\)-net and \(\epsilon\)-LA. Section 5 details the experiments conducted. Finally, Sect. 6 concludes with discussions and conclusions.

2 Terminologies and problem formulation

Biological neural networks are organized through the interconnection of neurons, allowing information to flow and form dynamic systems. ODEs are commonly used to describe dynamical systems, and can therefore be used to model neural networks. An ODE is referred to as a neural ordinary differential equation (neuralODE) when it is used to describe the dynamics of a neural network. All the ODEs discussed in this paper are considered neuralODEs.

A general neuralODE can be described as

$$\begin{aligned} {\dot{y}} = f(y,x) \end{aligned}$$
(1)

for \(t \ge 0\), where \(y \in R^{n}\) denotes the state of the network, \(x \in R^{m}\) represents external input variable and we take it for data inputting, the function f is continuous and satisfies some Lipschitz condition so that the existence and uniqueness of solution can be guaranteed. Given an initial y(0), the trajectory of equation (1) starting from y(0) is denoted by y(ty(0)) or simply y(t), for all \(t \ge 0\).

A set \(S \subset R^{n}\) is called an invariant set of equation (1), if for any \(y(0) \in S\), the trajectory y(t) starting from y(0) remains in S forever.

A vector \(y^{*} \in R^{n}\) is called an equilibrium point of the neuralODE if it satisfies

$$\begin{aligned} f(y^{*}, x) = 0. \end{aligned}$$

It is known that in biological neural networks, memories are encoded as attractors in the dynamics of the networks (Poucet and Save 2005; Wills et al. 2005). A suitably designed neuralODE can also have the ability to store memories as attractors in the biological neural networks. Let us define the concept of attractor for the general neuralODE first. There are several kinds of concept of attractors, among all, the concept of global attractor has a nice property in particular. A global attractor must be, first, an equilibrium point, in addition, it attracts all trajectories. Formally, an equilibrium point \(y^{*}\) is called a global attractor if for any y(0), the associated trajectory y(ty(0)) converges to \(y^{*}\) as \(t \rightarrow \infty\). Suppose that the neuralODE has the global attractor property, then, the neuralODE establishes a well-defined nonlinear mapping from input x to the associated attractor \(y^{*}\) as

$$\begin{aligned} \text{ neuralODE }: x \longrightarrow y^{*}. \end{aligned}$$

Figure 1 shows this mapping.

Fig. 1
figure 1

A nonlinear mapping established by neuralODE. It maps the input x to attractor \(y^{*}\). The neuralODE is required to possess global attractor property so that the mapping can be well established

Not every neuralODE has the global attractor property, designing a neuralODE possesses the global attractor property with simple structure and clear dynamics is desirable.

There are two approaches for inputting data into neuralODEs. One approach is to keep the external input x fixed and use the initial y(0) as the network input, while the other is to keep the initial y(0) fixed and use the external input x as the network input. Most existing neuralODEs adopt the first approach. In this paper, we use the second approach for data input. There are fundamental differences between these two approaches. When using the initial y(0) as the data input, neuralODEs have a significant drawback in learning representations that preserve the topology of the input data space. In fact, there are simple functions that neuralODEs cannot represent (Dupont et al. 2019). To address this issue, augmented neuralODEs were proposed in Dupont et al. (2019), but they require additional computations. On the other hand, when using x as the data input, this problem does not arise. To illustrate this further, consider the simple nonlinear function defined by:

$$\begin{aligned} \left\{ \begin{array}{ll} g(-1) = 1\\ g(1) = -1 \end{array} \right. \end{aligned}$$
(2)

In Dupont et al. (2019), it was proven that there is no ODE that can represent the function g if the initial y(0) is used as the data input. However, if we use x as the data input, this problem can be easily solved. The simple one-dimensional ODE

$$\begin{aligned} {\dot{y}} = -y - x \end{aligned}$$
(3)

can perfectly represent the function g. This can be verified by solving the ODE and obtaining the solution:

$$\begin{aligned} y(t) = \left[ y(0) + x\right] e^{-t} - x \end{aligned}$$

for \(t \ge 0\).

3 Neural memory ordinary differential equations

Before introducing the nmODE model, let us first consider the following two motivations: rethinking the one-layer network and utilizing implicit mapping functions.

3.1 On the one-layer network

The well-known model of a one-layer neural network can be expressed as

$$\begin{aligned} y = f\left[ W^{(1)} x + b \right] , \end{aligned}$$
(4)

where \(W^{(1)}\) is an \(n\times m\) matrix, x is the input vector with m dimensions, b is the bias vector with n dimensions, and y is the output vector with n dimensions. The activation function f is traditionally a nondecreasing function, such as the sigmoid function, tangent function, or ReLu function, among others. The network (4) defines a mapping from the input x to the output y, as shown in Fig. 2.

Fig. 2
figure 2

The one-layer neural network defines a mapping. It maps the input x to the output y

It is well-known that the one-layer neural network can be applied successfully to solve linear classification problems. However, it cannot solve nonlinear classification problems with a nondecreasing activation function. A classic example is the XOR problem, where the goal is to classify four points:

$$\begin{aligned} \left\{ \left[ \begin{array}{ll} 0\\ 1 \end{array} \right] , \left[ \begin{array}{ll} 1\\ 0 \end{array} \right] , \left[ \begin{array}{ll} 0\\ 0 \end{array} \right] , \left[ \begin{array}{ll} 1\\ 1 \end{array} \right] \right\} \end{aligned}$$

on a plane into two classes: the first two points in one class and the remaining two in another class, as illustrated in Fig. 3.

Fig. 3
figure 3

The XOR classification problem. The task is to classify the circle points into one class and the cross points into another one

The one-layer network cannot solve this classification problem because its nonlinearity is not sufficiently high. In fact, the classification ability of one-layer network is determined by the linear decision surface

$$\begin{aligned} W x + b = 0. \end{aligned}$$

If f is a nondecreasing function, the mapping of one-layer neural network is topologically linear, thus, it cannot solve nonlinear classification problems such as the XOR problem.

Increasing the nonlinearity of neural networks is crucial for computational ability. A simple method to increase the nonlinearity of the one-layer network is to remove the constraint of nondecreasing on activation functions. An example is to use \(\sin ^{2}(\cdot )\) function as activation function, i.e.,

$$\begin{aligned} y = \sin ^{2}\left[ W^{(1)} x + b \right] . \end{aligned}$$
(5)

The nonlinearity of this network is now much higher than that using nondecreasing activation function. One-layer network with this newly endowed activation function can solve the XOR problem easily by designing the parameters of the network (5) as

$$\begin{aligned} y = \sin ^{2}\left[ -\frac{\pi }{2} \left( x_{1}+x_{2}\right) + \frac{\pi }{2} \right] . \end{aligned}$$
(6)

It can be easily checked that the network (6) maps the points of \([0,0]^T, [1,1]^T\) to 1, while the points of \([0,1]^T, [1,0]^T\) to 0, thus, the XOR problem is perfectly solved, see Fig. 4.

Fig. 4
figure 4

The one-layer network uses the activation function \(\sin ^{2}(\cdot )\) to solve the XOR classification problem

3.2 An implicit mapping equation

Another approach to increase the nonlinearity of the one-layer network is to use implicit functions. By making a simple modification to the representation of the network, we can obtain an implicit mapping equation of the form:

$$\begin{aligned} y = f\left[ y + W^{(1)} x + b \right] . \end{aligned}$$
(7)

This implicit mapping equation defines a nonlinear mapping F from input x to output y as:

$$\begin{aligned} F: x \rightarrow y, \end{aligned}$$

where F is an implicit function, and it is clear that the nonlinearity of F is higher than that of f.

As discussed in Sect. 2, using the \(\sin ^2(\cdot )\) function as the activation function in the one-layer network can increase its nonlinearity and solve nonlinear classification problems such as the XOR problem. Together with the above methods of increasing nonlinearity, we propose a specific implicit mapping equation by

$$\begin{aligned} y = \sin ^{2} \left[ y + W^{(1)}x + b \right] , \end{aligned}$$
(8)

or equivalently in component form as

$$\begin{aligned} y_{i} = \sin ^{2}\left[ y_{i} + \sum ^{m}_{j=1} w^{(1)}_{ij} x_{j} + b_{i} \right] , (i=1, \cdots , n). \end{aligned}$$

We will show that this implicit mapping equation (8) has a unique solution. First, we will prove a lemma.

Lemma 1

The one-dimensional equation

$$\begin{aligned} p = \sin ^{2}(p + \gamma ) \end{aligned}$$
(9)

has one and only one solution \(p^{*}\), where \(\gamma\) is a constant.

Proof

Define a function

$$\begin{aligned} \mu (p) = p - \sin ^{2}(p + \gamma ). \end{aligned}$$

Clearly, \({\dot{\mu }}(p) \ge 0\), i.e., \(\mu\) is a nondecreasing function. Next, we consider three cases, \(p=1\), \(p=0\), and \(p \in (0, 1)\), respectively.

Obviously, \(p=1\) is a solution of (9) if and only if \(\gamma = k\pi + \pi /2 - 1\) for any integer k. Similarly, \(p=0\) is a solution of (9) if and only if \(\gamma = k\pi\) for any integer k.

Since \({\dot{\mu }}(p) > 0\) for all \(p \in (0, 1)\), it can be easily proved that there must exist one and only one \(p^{*}\) such that \(\mu (p^{*}) = 0\).

The above statements prove that the equation (9) has one and only one solution \(p^{*}\). The proof is complete. \(\square\)

Property 1

Given any x, the implicit mapping equation (8) has one and only one solution.

Proof

By Lemma 1, there exists one and only one \(y^{*}_{i}\) such that

$$\begin{aligned} y_{i} = \sin ^{2}\left[ y_{i} + \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right] \end{aligned}$$

for each i. Denote by \(y^{*} = (y_{1}^{*}, \cdots , y_{n}^{*})\), clearly, \(y^{*}\) is the only one solution of the implicit equation (8). The proof is complete.

The implicit mapping equation clearly defines an implicit function from x to y. Directly solving this equation to establish a mapping from input x to output y is generally impossible. Next, we will propose an ODE to solve this problem. \(\square\)

3.3 The proposed nmODE

The implicit mapping equation can be viewed as an equilibrium equation of an ordinary differential equation. To solve this equilibrium equation, an ODE, named nmODE (neural memory Ordinary Differential Equation), is proposed in this paper as follows:

$$\begin{aligned} {\dot{y}} = - y + \sin ^{2} \left[ y + W^{(1)} x + b \right] \end{aligned}$$
(10)

or in component form

$$\begin{aligned} {\dot{y}}_{i} = - y_{i} + \sin ^{2}\left[ y_{i} + \sum ^{m}_{j=1} w^{(1)}_{ij} x_{j} + b_{i} \right] , (i=1, \cdots , n) \end{aligned}$$
(11)

for \(t \ge 0\), where \(y \in R^{n}\), \(x \in R^{m}\), \(b \in R^{n}\), \(W^{(1)}=\left( w^{(1)}_{ij}\right) _{n \times m} \in R^{n \times m}\). From the neural networks point of view, x represents external input to the network, y(t) denotes the state of the network at time t, and y(0) is the initial value. Each neuron used to represent the external input variable x is called an input neuron, while any neuron used to represent the state y is called a memory neuron.

Are there anything special in the nmODE? First, the nmODE is a decoupled system for the memory neurons, which makes it particularly easy for mathematical analysis of its dynamics. In addition, the nmODE can be solved simply by solving the one-dimensional ODEs for each i in equation (11) since these one-dimensional ODEs are independent each other. Second, each learning connection is only connected from an input neuron to memory neurons, and there is no learning connections among any memory neurons, thus, learning is separated from memory. Such structure makes the model particularly easy for network learning. Third, we will show that nmODE has global attractor property for each external input. By regarding the attractor as a memory (Poucet and Save 2005; Wills et al. 2005), the memory mechanism can be embedded into the network. Fourth, nmODE could be looked as a math model of the column in neocortex. It was suggested by recent reported research work on theory neural science that each column in neocortex may play a role of intelligence unit Hawkins and Dawkins (2021), if so, the column must have a complete memory mechanism so that it can model the world.

As the nmODE is a decoupled system, we can define a math model for each neuron as

$$\begin{aligned} {\dot{p}} = -p + \sin ^{2} \left( p + \gamma \right) , \end{aligned}$$
(12)

where \(p \in R^{1}\) and \(\gamma\) represents perception input to the neuron. We call this one-dimensional ODE as neuronODE. Clearly, the nmODE can be looked as built by the neuronODE block. Different from many traditional math models of neurons described in simple functions, the neuronODE is described by a dynamical system of ODE.

Next, we will show that the proposed nmODE has several interesting dynamical properties.

Property 2

The set

$$\begin{aligned} S = \left\{ \left. y \in R^{n} \right| \Vert y\Vert \le 1 \right\} \end{aligned}$$

is an invariant set of the nmODE.

Proof

Given any initial \(y(0) \in S\), by integrating the differential equation, it gives that

$$\begin{aligned} y_{i}(t) = y_{i}(0) e^{-t} + \int ^{t}_{0} e^{-(t-s)} \cdot \sin ^{2} \left[ y_{i}(s) + \sum _{j=1}^{m} w^{(1)}_{ij} x_{j} + b_{j}\right] ds \end{aligned}$$
(13)

then,

$$\begin{aligned} y_{i}(t)\le & {} y_{i}(0)e^{-t} + \int ^{t}_{0} e^{-(t-s)} ds \end{aligned}$$
(14)
$$\begin{aligned}= & {} \left( 1- y_{i}(0) \right) \cdot e^{-t} \end{aligned}$$
(15)
$$\begin{aligned}\le & {} 1 \end{aligned}$$
(16)

for all \(t \ge 0\). Thus, \(y(t) \in S\) for all \(t \ge 0\), i.e., the set S is an invariant set. The proof is complete. \(\square\)

Property 3

The nmODE (10) has one and only one equilibrium point.

The above property 3 is directly from the property 1.

Lemma 2

The neuronODE (12) has one and only one global attractor.

Proof

By property 3, there exists a unique \(p^{*}\) such that

$$\begin{aligned} p^{*} = \sin ^{2}(p^{*} + \gamma ). \end{aligned}$$

Define a differentiable function

$$\begin{aligned} V(t)=\frac{1}{2}\left( p(t)-p^{*}\right) ^{2} \end{aligned}$$

for \(t \ge 0\). Given any initial value p(0), along any trajectory p(t) starting from p(0), it gives that

$$\begin{aligned} {\dot{V}}(t)= & {} -\left( p(t) - p^{*}\right) ^{2} \\{} & {} + \left[ p(t)-p^{*}\right] \cdot \left[ \sin ^{2}(p(t) + \gamma ) - \sin ^{2}(p^{*} + \gamma ) \right] \\= & {} -\left( 1-\left| \sin \left( 2\theta (t)\right) \right| \right) \cdot \left( p(t) - p^{*}\right) ^{2} \\\le & {} 0 \end{aligned}$$

where \(\left| \sin \left( 2\theta (t)\right) \right| \le 1\), and \(\theta (t)\) is a function of t located between \(p(t) + \gamma\) and \(p^{*} + \gamma\). Let \(t_{k} (k=1, 2, \cdots )\) be the time sequence such that \(\vert \sin \left( 2\theta (t_{k})\right) \vert = 1\). Clearly, \(V(t_{k+1}) \le V(t_{k})\), and on each interval \((t_{k}, t_{k+1})\), \(\left| \sin \left( 2\theta (t)\right) \right| < 1\), thus, \({\dot{V}}(t) < 1\) for \(t \in (t_{k}, t_{k+1})\). Together with \(V(t_{k+1}) \le V(t_{k})\), it yields that \(V(t) \rightarrow 0\) as \(t \rightarrow +\infty\), i.e., \(p(t) \rightarrow p^{*}\). This proves that \(p^{*}\) is the global attractor. The proof is complete. \(\square\)

Property 4

The nmODE (10) has one and only one global attractor.

Proof

By using the Lemma 2, for each \(i(1 \le i \le n)\), the equation

$$\begin{aligned} {\dot{y}}_{i} = - y_{i} + \sin ^{2}\left[ y_{i} + \sum ^{m}_{j=1} w^{(1)}_{ij} x_{j} + b_{i} \right] \end{aligned}$$

has one and only one \(y_{i}^{*}\) such that \(y_{i}(t) \rightarrow y_{i}^{*}\) as \(t \rightarrow +\infty\). Denoting \(y^{*} = (y_{1}^{*}, \cdots , y_{n}^{*})^{T}\), clearly, \(y^{*}\) is the global attractor of the nmODE. This completes the proof. \(\square\)

With the property that nmODE has one and only one global attractor \(y^{*}\) which satisfies

$$\begin{aligned} y^{*} = \sin ^{2}\left[ y^{*} + W^{(1)} x + b \right] , \end{aligned}$$

the nmODE defines a nonlinear mapping from external input x to the attractor \(y^{*}\). Figure 5 shows an intuitive illustration for this nmODE nonlinear mapping.

Fig. 5
figure 5

Given any input x, the proposed nmODE has one and only one global attractor \(y^{*}\). Thus, the nmODE defines a well nonlinear mapping from the external input x to the attractor \(y^{*}\)

Next, we will develop a learning algorithm for the proposed nmODE.

3.4 The proposed nmLA

In this subsection, we will develop a learning algorithm so that the nmODE can be used to learn some targets.

To clearly describe the learning algorithm, define

$$\begin{aligned} \left\{ \begin{array}{ll} z_{i}(t) = \displaystyle \sum _{j=1}^{n} w_{ij}^{(2)} y_{j}(t)\\ a_{i}(t) = s\left( z_{i}(t) \right) \end{array} \right. \end{aligned}$$

where \(W^{(2)} = \left( w^{(2)}_{ij}\right) _{r \times n}\) is another set of learning parameters, s is an activation function, say, softmax. We refer to neurons that represent the output a(t) as decision-making neurons. In nmODE, each memory neuron i produces output \(y_i(t)\) independently at each time t, and all memory neurons form a pattern y(t). To give meaning to the pattern, a learning mechanism is necessary to fuse all \(y_i(t)\). This is where the decision-making neurons come in, as they play a crucial role in this process. Figure 6 shows an intuitive illustration.

Fig. 6
figure 6

Given an input x, the nmODE will output y(t) at each time t. Neurons used to represent a(t) play role as decision making. Learning means to adjust \(W^{1}, W^{2}\) and b

Suppose that we have a target d associated with the input x, and would like to let the nmODE learn the target. This can be described by developing some learning rules for updating \(w^{(2)}_{ij}\), \(w^{(1)}_{ij}\) and \(b_{i}\) such that

$$\begin{aligned} a({\bar{t}}) \approx d \end{aligned}$$

for some time \({\bar{t}}\). We define a cost function \(J=J(a({\bar{t}}), d)\) to describe the distance between \(a({\bar{t}})\) and d. Next, we drive learning rules for updating \(w^{(2)}_{ij}, w^{(1)}_{ij}\) and \(b_{i}\).

Denote

$$\begin{aligned} \delta _{i}^{z} = \frac{\partial J}{\partial z_{i}({\bar{t}})}, (i=1, \cdots , n). \end{aligned}$$

Each \(\delta _{i}^{z}\) can be computed out directly. It holds that

$$\begin{aligned} \frac{\partial J}{\partial w^{(2)}_{ij}} = \frac{\partial J}{\partial z_{i}({\bar{t}})} \cdot \frac{\partial z_{i}({\bar{t}})}{\partial w^{(2)}_{ij}} = \delta _{i}^{z} \cdot y_{j}({\bar{t}}). \end{aligned}$$

By using the gradient method, it gives a learning rule for updating \(w^{(2)}_{ij}\) as

$$\begin{aligned} w^{(2)}_{ij} \longleftarrow w^{(2)}_{ij} - \alpha \cdot \delta _{i}^{z} \cdot y_{j}({\bar{t}}), \end{aligned}$$
(17)

where \(\alpha\) is a learning rate. Next, we derive a learning rule for updating \(w^{(1)}_{ij}\) and \(b_{i}\).

For convenience, denote

$$\begin{aligned} v_{i}(t) = y_{i}(t) + \sum _{j=1}^{m}w^{(1)}_{ij} x_{j} + b_{i}, (i=1, \cdots , n). \end{aligned}$$

Define a Lagrange function:

$$\begin{aligned} L = J + \sum ^{n}_{k=1} \int ^{{\bar{t}}}_{0} \lambda _{k}(t) \cdot \left[ -y_{k}(t) + \sin ^{2}(v_{k}(t)) - {\dot{y}}_{k}(t)\right] dt, \end{aligned}$$
(18)

where \(\lambda _{k}(t)(k=1, \cdots , n)\) are the Lagrange multipliers. It gives that

$$\begin{aligned} \frac{\partial L}{\partial w_{ij}^{(1)}}= & {} \frac{\partial J}{\partial w_{ij}^{(1)}} + \sum ^{n}_{k=1} \int ^{{\bar{t}}}_{0} \frac{\partial \left[ -y_{k}(t) + \sin ^{2}(v_{k}(t)) - {\dot{y}}_{k}(t)\right] }{\partial w_{ij}^{(1)}} \cdot \lambda _{k}(t) dt \\= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \frac{\partial J}{\partial y_{i}({\bar{t}})} + \int ^{{\bar{t}}}_{0} \left[ \frac{\partial \left[ -y_{i}(t) + \sin ^{2}(v_{i}(t)) \right] }{\partial y_{i}(t)} \cdot \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \right. \\{} & {} \left. + \frac{\partial \left[ -y_{i}(t) + \sin ^{2}(v_{i}(t)) \right] }{\partial w_{ij}^{(1)}} - \frac{\partial {\dot{y}}_{i}(t)}{\partial w_{ij}^{(1)}} \right] \cdot \lambda _{i}(t) dt \\= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \frac{\partial J}{\partial y_{i}({\bar{t}})} + \int ^{{\bar{t}}}_{0} \left[ -1 + \sin 2v_{i}(t)\right] \cdot \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) dt \\{} & {} + \int ^{{\bar{t}}}_{0} \sin 2v_{i}(t) \cdot x_{j} \cdot \lambda _{i}(t) dt - \int ^{{\bar{t}}}_{0} \frac{\partial {\dot{y}}_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) dt. \\ \end{aligned}$$

Since

$$\begin{aligned} \int ^{{\bar{t}}}_{0} \frac{\partial {\dot{y}}_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) dt= & {} \int ^{{\bar{t}}}_{0} \frac{\partial dy(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) \\= & {} \int ^{{\bar{t}}}_{0} d\left[ \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \right] \cdot \lambda _{i}(t) \\= & {} \left. \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}(t) \right| ^{{\bar{t}}}_{0} - \int ^{{\bar{t}}}_{0} \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot d \lambda _{i}(t) \\= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \lambda _{i}({\bar{t}}) - \int ^{{\bar{t}}}_{0} \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot {\dot{\lambda }}_{i}(t) dt \\ \end{aligned}$$

and

$$\begin{aligned} \frac{\partial J}{\partial y_{i}({\bar{t}})} = \sum ^{m}_{j=1} \frac{\partial J}{\partial z_{j}({\bar{t}})} \cdot \frac{\partial z_{j}({\bar{t}})}{\partial y_{i}({\bar{t}})} = \sum ^{m}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)} \end{aligned}$$

then,

$$\begin{aligned} \frac{\partial L}{\partial w_{ij}^{(1)}}= & {} \frac{\partial y_{i}({\bar{t}})}{\partial w_{ij}^{(1)}} \cdot \left[ \sum ^{r}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)} - \lambda _{i}({\bar{t}}) \right] \\{} & {} + \int ^{{\bar{t}}}_{0} \frac{\partial y_{i}(t)}{\partial w_{ij}^{(1)}} \cdot \left[ \left( -1 + \sin 2v_{i}(t)\right) \cdot \lambda _{i}(t) + {\dot{\lambda }}_{i}(t) \right] dt \\{} & {} + x_{j} \int ^{{\bar{t}}}_{0} \lambda _{i}(t) \cdot \sin 2v_{i}(t) dt. \end{aligned}$$

Since \(\lambda _{k}(t)(k=1, \cdots , n)\) are the Lagrange multipliers, we can take

$$\begin{aligned} \lambda _{i}({\bar{t}}) = \sum ^{r}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)} \end{aligned}$$

and

$$\begin{aligned} {\dot{\lambda }}_{i}(t) = \left[ 1 - \sin 2v_{i}(t)\right] \cdot \lambda _{i}(t) \end{aligned}$$

then, we have

$$\begin{aligned} \frac{\partial L}{\partial w_{ij}^{(1)}} = x_{j} \int ^{{\bar{t}}}_{0} \sin 2v_{i}(t) \cdot \lambda _{i}(t) dt. \end{aligned}$$

Noting that at minimum point, it holds that

$$\begin{aligned} \frac{\partial J}{\partial w_{ij}^{(1)}} = \frac{\partial L}{\partial w_{ij}^{(1)}} = x_{j} \int ^{{\bar{t}}}_{0} \sin 2v_{i}(t) \cdot \lambda _{i}(t) dt. \end{aligned}$$

Define gradient functions by

$$\begin{aligned} \xi _{ij}(t) = x_{j} \int ^{{\bar{t}}}_{t} \sin 2v_{i}(\theta ) \cdot \lambda _{i}(\theta ) d\theta \end{aligned}$$

for \(0 \le t \le {\bar{t}}\) and \(1 \le i \le n, 1 \le j \le m\), clearly,

$$\begin{aligned} \xi _{ij}(0) = \frac{\partial L}{\partial w_{ij}^{(1)}}, \quad \xi _{ij}({\bar{t}})=0, \end{aligned}$$

and

$$\begin{aligned} {\dot{\xi }}_{ij}(t) = -x_{j} \cdot \sin 2v_{i}(t) \cdot \lambda _{i}(t). \end{aligned}$$

Similarly, define

$$\begin{aligned} \eta _{i}(t) = \int ^{{\bar{t}}}_{t} \sin 2v_{i}(\theta ) \cdot \lambda _{i}(\theta ) d\theta \end{aligned}$$

for \(0 \le t \le {\bar{t}}\) and \(1 \le i \le n\), it holds that

$$\begin{aligned} \eta _{i}(0) = \frac{\partial L}{\partial b_{i}}, \quad \eta _{i}({\bar{t}})=0, \end{aligned}$$

and

$$\begin{aligned} {\dot{\eta }}_{i}(t) = - \sin 2v_{i}(t) \cdot \lambda _{i}(t). \end{aligned}$$

The function \(\xi _{ij}(t)\) and \(\eta _{i}(t)\) have a relationship \(\xi _{ij}(t) = x_{j} \cdot \eta _{i}(t)\), then, it holds a gradient relation

$$\begin{aligned} \xi _{ij}(0) = x_{j} \cdot \eta _{i}(0). \end{aligned}$$
(19)

This means that the gradient \(\xi _{ij}(0)\) can be computed by using the gradient \(\eta _{i}(0)\).

From the above analysis, for each i, we can get a three-dimensional differential equation

$$\begin{aligned} \left\{ \begin{array}{lll} {\dot{y}}_{i} = - y_{i} + \sin ^{2}\left( y_{i} + \displaystyle \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right) \\ {\dot{\lambda }}_{i} = \left[ 1 - \sin 2\left( y_{i} + \displaystyle \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right) \right] \cdot \lambda _{i}\\ \dot{\eta _{i}} = - \sin 2\left( y_{i} + \displaystyle \sum ^{m}_{i=1} w^{(1)}_{ij} x_{j} + b_{i} \right) \cdot \lambda _{i} \end{array} \right. \end{aligned}$$
(20)

with initial as

$$\begin{aligned} \left\{ \begin{array}{llll} y_{i}({\bar{t}}) = y_{i}({\bar{t}})\\ \lambda _{i}({\bar{t}}) = \displaystyle \sum ^{r}_{j=1} \delta _{j}^{z} \cdot w_{ji}^{(2)}\\ \eta _{i}({\bar{t}}) = 0 \end{array} \right. \end{aligned}$$
(21)

Inversely solving the above three-dimensional ODE from time \({\bar{t}}\) to time 0 for each i, it gets that \(\eta _{i}(0)\). Using the gradient relation (19), \(\xi _{ij}(0)\) can be computed out. Then, by using the gradient method, a learning rule is given by

$$\begin{aligned} \left\{ \begin{array}{ccccc} w_{ij}^{(1)} &{} \longleftarrow &{} w_{ij}^{(1)} &{} - &{} \beta \cdot \xi _{ij}(0)\\ \\ b_{i} &{} \longleftarrow &{} b_{i} &{} - &{} \beta \cdot \eta _{i}(0) \end{array} \right. \end{aligned}$$
(22)

where \(\beta\) is a learning rate.

To facilitate the description of the algorithm that follows, we define a three-dimensional ODE, called inverse Ordinary differential equation (invODE):

$$\begin{aligned} \left\{ \begin{array}{lll} {\dot{p}} = - p + \sin ^{2}\left( p + \gamma \right) \\ {\dot{\lambda }} = \lambda \cdot \left[ 1 - \sin 2\left( p + \gamma \right) \right] \\ {\dot{\eta }} = - \lambda \cdot \sin 2\left( p + \gamma \right) \end{array} \right. \end{aligned}$$
(23)

for \(t \ge 0\). In invODE (23), the first equation is the neuronODE defined in (12), the second equation is the Lagrange multiplier equation, and the third one is the gradient equation. It should be noted here that most existing inverse ODEs for constructing learning algorithms are with very high dimensions and thus not efficient for computing.

As a summary of the above description, a learning algorithm, called neural memory learning algorithm (nmLA), can be given as follows.

Step 1: Input training data set X.

Step 2: Initialize parameters \(W^{(1)}, W^{(2)}, b, \alpha , \beta , y(0)\) and \({\bar{t}}\).

Step 3: Select a mini-batch \(X_{m} \in X\).

Step 4: For each \(x \in X_{m}\), for each i, solve the corresponding one-dimensional ODE in (11) to get \(y_{i}({\bar{t}})\).

Step 5: Calculate \(\delta ^{z}_{i}\).

Step 6: For each i, solve the three-dimensional invODE to get \(\eta _{i}(0)\) and using (19) to compute \(\xi _{ij}(0)\).

Step 7: Sum up the gradients \(\xi _{ij}(0)\) and \(\eta _{i}(0)\), respectly.

Step 8: Use (17) and (22) for updating \(w_{ij}^{(1)}\), \(w_{ij}^{(2)}\) and \(b_{i}\).

Step 9: Back to step 3 until \(w_{ij}^{(1)}\), \(w_{ij}^{(2)}\) and \(b_{i}\) converge.

Step 10: Return \(W^{(1)}, W^{(2)}\) and b.

In the nmLA, the learning rates \(\alpha\) and \(\beta\) control the convergence speed of the connection weights. Choosing these parameters too large or too small can affect the convergence. In practice, they are often chosen by trial and error. The initial y(0) of the nmODE is suggested to be chosen from the invariant set S, such as \(y(0)=0\), to ensure fast convergence to the attractor \(y^{*}\). While the terminal time \({\bar{t}}\) could also be optimized by learning, nmODE converges very quickly, so it is suggested to choose an appropriate value.

3.5 Biological point of view on nmODE

Mountcastle (Mountcastle 1978, 1997) proposed the innovative idea that cortical columns have uniform architectures and carry out similar functions, suggesting that brain columns share a common cortical algorithm. More recently, Hawkins (Hawkins and Dawkins 2021; Hawkins et al. 2019) proposed another surprising idea that the unit of intelligence in the brain is the cortical column, each of which can learn to model the world by utilizing the endowed reference frame property. Since columns run a common algorithm, it is necessary for multiple columns to work together to make a decision in order to complete a task. Hawkins believed that this process is achieved through voting. He referred to this theory as “A Thousand Brains". If this hypothesis is accurate, it could contribute to a better understanding of brain intelligence.

It is widely recognized that memory plays a crucial role in intelligence. If we accept that the cortical column is the unit of intelligence, it follows that each column must possess a memory mechanism. On the other hand, a column in the neocortex may contain several classes of neurons, such as pyramid neurons, starlike neurons, etc.. Different classes of neurons may have different functions in a column. It may have a class of neurons execute function of memory, we call such neurons as memory neurons. Inspired by the hypothesis of Mountcastle and Hawkins that there is a common algorithm for columns, we propose a hypothesis that there is a common algorithm for memory neurons in a column, further, all the memory neurons have a same dynamical system regardless perception input. As memory neurons in a column run the same dynamical system independently, to get all the memory neurons to make a decision, there should have a mechanism to connect memory neurons, we propose that this mechanism could be achieved by using decision neurons. Figure 7 gives an illustration.

Fig. 7
figure 7

Hypothesis on columns and neurons: A column can be considered a unit of intelligence that executes a shared algorithm. Decision-making occurs through voting among a group of columns. Within a column, there is a class of memory neurons that operate using a common algorithm. These memory neurons connect to decision neurons, which facilitate the column’s overall decision-making process

From an artificial intelligence perspective, establishing a mathematical model for columns is crucial for solving practical problems. When building the model, some considerations should be kept in mind, such as: (1) the memory neurons in a column should have the same dynamical system and an embedded memory mechanism, (2) each column should be able to receive perception input and deliver stable output, (3) in the absence of perception inputs, the dynamics of a column should remain in a resting state, (4) the math model should have a simple structure with clear dynamics for analysis, and (5) the math model should have sufficiently high nonlinearity to provide powerful representation capability.

The proposed nmODE seems satisfy all the requirements listed above and can be looked as a math model of column. When a perception input x is received by the column, it will deliver a stable output \(y^{*}\) which is the global attractor and can be interpreted as the memory associated with the input x. In case of lacking perception input, i.e., \(x=0\), the nmODE can be reduced to \({\dot{y}} = - y + \sin ^{2}\left[ y + b \right]\). It is easy to see that the state will stay in a resting state \(y^{*}\) which satisfies \(y^{*} = \sin ^{2}\left[ y^{*} + b \right]\). If, in addition, there is no bias, i.e. \(b=0\), then \(y^{*}=0\). This property can be interpreted as that column keeps silence in lacking of perception input.

The nmODE is a decoupled system, which implies that different neurons have a same function. In fact, the nmODE can be rewritten as

$$\begin{aligned} {\dot{y}}_{i} = -y_{i} + \sin ^{2} \left( y_{i} + \gamma _{i}\right) \end{aligned}$$

where \(\gamma _{i}\) can be considered as perception input to the \(i^{th}\) neuron. Removing the neuron index and replacing \(y_{i}\) by the one dimensional variable p, clearly, each neuron obeys the dynamics of the neuronODE (12), which can be looked as a common dynamical system of all the memory neurons.

4 The proposed \(\epsilon\)-network

The proposed nmODE is described by a specific ordinary differential equation. In practical applications, sometimes it is not convenient in particular for digit computers to compute. In this section, we propose a discrete version of the proposed nmODE, called \(\epsilon\)-net, together with an efficient learning algorithm.

4.1 The \(\epsilon\)-net Structure

The proposed \(\epsilon\)-net is a discrete version of the nmODE described by an iterative equation as

$$\begin{aligned} y^{l+1} = (1 - \epsilon ) \cdot y^{l} + \epsilon \cdot \sin ^{2}\left[ y^{l} + W^{(1)} x + b \right] \end{aligned}$$
(24)

where \(0< \epsilon < 1\) is a parameter, l denotes layer number. Given an initial point \(y^{0} \in R^{n}\), starting from \(y^{0}\), the set \(\{\left. y^{l} \right| l \ge 0\}\) forms a trajectory in \(R^{n}\) space. Digit computing for the \(\epsilon\)-net is just by iterating from layer l to layer \(l+1\), Fig. 8 shows an intuitive illustration.

Fig. 8
figure 8

The proposed \(\epsilon\)-net. The red circles represent the input neurons. The dashed circles denote the virtual neurons with \(sine^{2}\) function as activation function. The rest circles denote the network inner representation neurons with line function as activation function

The \(\epsilon\)-net is in fact a discrete time dynamic system. As in nmODE, the equilibrium point \(y^{*}\) is defined by

$$\begin{aligned} y^{*} = \sin ^{2}\left[ y^{*} + W^{(1)} x + b \right] . \end{aligned}$$

Given an input x, clearly, the \(\epsilon\)-net has one and only one equilibrium point \(y^{*}\). The equilibrium point \(y^{*}\) is called a global attractor, if for any initial point \(y^{0} \in R^{n}\), the trajectory \(\{\left. y^{l} \right| l \ge 0\}\) starting from \(y^{0}\), converges to \(y^{*}\), i.e.,

$$\begin{aligned} y^{l} \longrightarrow y^{*} \end{aligned}$$

as \(l \rightarrow +\infty\). If the above global attractor definition holds for all \(y^{0} \in R^{n}\) excepting a zero measure set, we call \(y^{*}\) an almost global attractor.

In Sect. 3, it proved that nmODE has one and only one global attractor. A question here is that whether the \(\epsilon\)-net also possesses this property? From the theory of ODEs, in general, an ODE has the property of global attractor cannot guarantee that its discrete version also has this property. For example, the ODE

$$\begin{aligned} {\dot{y}} = -y - y^{3} \end{aligned}$$

has the global attractor property, in fact, \(y^{*}=0\) is the global attractor. However, its discrete version

$$\begin{aligned} y^{l+1} = (1-\epsilon ) y^{l} -\epsilon \left( y^{l} \right) ^3 \end{aligned}$$

does not have this property for any \(\epsilon \in (0, 1)\). In fact, it can be easily checked that any trajectory starting from \(y(0) > 1\) will diverge. Thus, it is necessary to check the global attractor property of the \(\epsilon\)-net. Fortunately, we can prove that the proposed \(\epsilon\)-net has, indeed, the global attractor property.

Property 5

Given any input x, the \(\epsilon\)-net has one and only one almost global attractor.

Proof

Given any input x, suppose that \(y^{*}\) is the unique equilibrium of the \(\epsilon\)-net. For any initial \(y^{0}\), we have

$$\begin{aligned} y^{l+1} - y^{*}= & {} (1-\epsilon ) \left[ y^{l} - y^{*}\right] \\{} & {} + \epsilon \cdot \left[ \sin ^{2}\left( y^{l} + W^{(1)} x + b \right) - \sin ^{2}\left( y^{*} + W^{(1)} x + b \right) \right] . \end{aligned}$$

Then, it gives that

$$\begin{aligned} \left| y^{l+1}_{i} - y^{*}_{i}\right|\le & {} (1-\epsilon ) \left| y^{l}_{i} - y^{*}_{i} \right| \\{} & {} + \epsilon \cdot \left| \sin (2\theta _{i}^{l}) \right| \cdot \left| y^{l}_{i} - y^{*}_{i} \right| \\= & {} \delta _{i}^{l} \cdot \left| y^{l}_{i} - y^{*}_{i} \right| \\= & {} \displaystyle \prod _{k=0}^{l} \delta _{i}^{k} \cdot \left| y^{0}_{i} - y^{*}_{i} \right| , \end{aligned}$$

where \(\delta _{i}^{l} = 1 - \epsilon \left( 1-\left| \sin (2\theta _{i}^{l})\right| \right)\), and \(\theta _{i}^{l}\) is located between \(y^{l}_{i} + \sum ^{m}_{j=1}w_{ij} x_{j} + b_{i}\) and \(y^{*}_{i} + \sum ^{m}_{j=1}w_{ij} x_{j} + b_{i}\). Since \(0 < \delta _{i}^{l} \le 1\), and the measure of the set \(S_{i} = \{l \vert \delta _{i}^{l} = 1, l \ge 1\}\) is zero, thus, almost

$$\begin{aligned} \prod _{k=0}^{l} \delta _{i}^{k} \longrightarrow 0 \end{aligned}$$

as \(l \rightarrow +\infty\), and almost

$$\begin{aligned} y^{l}_{i} \longrightarrow y^{*}_{i}, (i=1, \cdots , n) \end{aligned}$$

as \(l \rightarrow +\infty\). The proof is complete. \(\square\)

The above proof suggests that the \(\epsilon\)-net has an almost global attractor property. We hypothesize that it actually possesses the global attractor property, but we are currently unable to prove this.

Similarly as that of the nmODE, the \(\epsilon\)-net defines a nonlinear mapping from input x to the global attractor \(y^{*}\), i.e.,

$$\begin{aligned} \epsilon -\text{ net-mapping }: x \longrightarrow y^{*}. \end{aligned}$$

Figure 9 gives an intuitive illustration for this mapping.

Fig. 9
figure 9

The \(\epsilon\)-net mapping. It maps the input x to the associated global attractor \(y^{*}\)

Regarding attractors as memories, the \(\epsilon\)-net-mapping establishes a relationship between input x and the stored memory \(y^{*}\).

4.2 The proposed \(\epsilon\)-LA

The \(\epsilon\)-net establishes a nonlinear mapping from input x to the global attractor \(y^{*}\) as network’s output. To make this nonlinear mapping defined by \(\epsilon\)-net be applied in practice, learning algorithm must be developed. In this paper, we propose an efficient learning algorithm, called \(\epsilon\)-LA for the \(\epsilon\)-net.

In order to establish a learning algorithm, we define the network output at layer \(l+1\) as

$$\begin{aligned} \left\{ \begin{array}{ll} z^{l+1}_{i} = \displaystyle \sum _{j=1}^{n} w_{ij}^{(2)} y_{j}^{l+1}\\ a^{l+1}_{i} = s\left( z^{l+1}_{i} \right) \\ v_{i}^{l} = y_{i}^{l} + \displaystyle \sum _{j=1}^{n}w^{(1)}_{ij} x_{j} + b_{i} \end{array} \right. \end{aligned}$$
(25)

where \(z^{l+1} \in R^{r}, a^{l+1} \in R^{r}\), \(\left( w^{(2)}_{ij}\right) _{r \times n}\) is a \(r \times n\) matrix, s is an activation function which can be selected according to practical applications, for example, in classification problem, the softmax activation function is frequently used. We call neurons that represent \(a^{l+1}\) as decision making neurons. Figure 10 gives an illustration for the computation.

Fig. 10
figure 10

The proposed \(\epsilon\)-net learning algorithm. The bold green circles represent decision making neurons with softmax function as activation function

Suppose that the network is required to learn a target \(d_{i} (i=1, \cdots , r)\). Using the output \(a_{i}^{l+1}\) together with the learning target \(d_{i}\), a cost function \(J^{l+1}\) can be constructed, say, the well known Cross Entropy. Clearly, \(J^{l+1}\) is a function of \(z_{i}^{l+1}\) and \(w^{(2)}_{ij}\). In order to use gradient method, it requires first to calculate out the gradients:

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial w_{ij}^{(2)}}, \quad \frac{\partial J^{l+1}}{\partial w_{ij}^{(1)}}, \quad \frac{\partial J^{l+1}}{\partial b_{i}}. \end{aligned}$$

Define

$$\begin{aligned} \left\{ \begin{array}{ll} \delta _{i}^{l+1}(z) = \displaystyle \frac{\partial J^{l+1}}{\partial z_{i}^{l+1}},\\ \\ \delta _{i}^{l}(v) = \displaystyle \frac{\partial J^{l+1}}{\partial v_{i}^{l}}. \end{array} \right. \end{aligned}$$

The \(\delta _{i}^{l+1}(z)\) can be directly calculated out. For example, in case that \(d_{1}+\cdots + d_{r} = 1\), s is the softmax function, and \(J^{l+1}\) is the Cross Entropy:

$$\begin{aligned} J^{l+1} = -\sum ^{r}_{i=1} d_{i} \cdot \log \left( a_{i}^{l+1}\right) \end{aligned}$$

then, \(\delta _{i}^{l+1}(z) = a_{i}^{l+1} - d_{i}\).

Using \(\delta _{i}^{l+1}(z)\), we can calculate out the gradient

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial w_{ij}^{(2)}} = \frac{\partial J^{l+1}}{\partial z_{i}^{l+1}} \cdot \frac{\partial z_{i}^{l+1}}{\partial w_{ij}^{(2)}} = \delta _{i}^{l+1}(z) \cdot y_{j}^{l+1}. \end{aligned}$$
(26)

Next, we calculate \(\frac{\partial J^{l+1}}{\partial w_{ij}^{(1)}}\) and \(\frac{\partial J^{l+1}}{\partial b_{i}}\). We have

$$\begin{aligned} \delta _{i}^{l}(v)= & {} \frac{\partial J^{l+1}}{\partial y_{i}^{l+1}} \cdot \frac{\partial y_{i}^{l+1}}{\partial v_{i}^{l}} \nonumber \\= & {} \epsilon \cdot \sin (2v_{i}) \cdot \frac{\partial J^{l+1}}{\partial y_{i}^{l+1}} \nonumber \\= & {} \epsilon \cdot \sin (2v_{i}) \cdot \sum ^{n}_{j=1}\frac{\partial J^{l+1}}{\partial z_{j}^{l+1}} \cdot \frac{\partial z_{j}^{l+1}}{\partial y_{i}^{l+1}} \nonumber \\= & {} \epsilon \cdot \sin (2v_{i}) \cdot \sum ^{n}_{j=1}w^{(2)}_{ji} \cdot \delta _{j}^{l+1}(z). \end{aligned}$$
(27)

Using

$$\begin{aligned} y_{i}^{l+1} = (1-\epsilon ) y_{i}^{l} + \epsilon \cdot \sin ^{2}(v_{i}^{l}) \end{aligned}$$

and noting that \(y_{i}^{0}\) is independent of \(w_{ij}^{(1)}x_{j}\) and \(b_{i}\), it gives that

$$\begin{aligned} \left\{ \begin{array}{llll} \displaystyle \frac{\partial y_{i}^{l+1}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } = (1-\epsilon ) \frac{\partial y_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } + \epsilon \cdot \sin (2 v_{i}^{l}) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) }\right] \\ \\ \displaystyle \frac{\partial y_{i}^{l+1}}{\partial b_{i}} = (1-\epsilon ) \frac{\partial y_{i}^{l}}{\partial b_{i}} + \epsilon \cdot \sin (2 v_{i}^{l}) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial b_{i}}\right] \\ \\ \displaystyle \frac{\partial y_{i}^{0}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } = 0\\ \\ \displaystyle \frac{\partial y_{i}^{0}}{\partial b_{i}} = 0. \end{array} \right. \end{aligned}$$
(28)

Using \(\delta _{i}^{l+1}(z)\) and \(\delta _{i}^{l}(v)\), we can calculate the gradient

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial w_{ij}^{(1)}} = \frac{\partial J^{l+1}}{\partial v_{i}^{l}} \cdot \frac{\partial v_{i}^{l}}{\partial w_{ij}^{(1)}} = \delta _{i}^{l}(v) \cdot \frac{\partial v_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } \cdot x_{j} = \delta _{i}^{l}(v) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial \left( w_{ij}^{(1)}x_{j}\right) } \right] x_{j}. \end{aligned}$$
(29)

Similarly, we have

$$\begin{aligned} \frac{\partial J^{l+1}}{\partial b_{i}} = \delta _{i}^{l}(v) \cdot \left[ 1 + \frac{\partial y_{i}^{l}}{\partial b_{i}} \right] . \end{aligned}$$
(30)

Then, by using the well known gradient method, it gives a learning rule as

$$\begin{aligned} \left\{ \begin{array}{ccl} w_{ij}^{(2)} &{} \longleftarrow &{} w_{ij}^{(2)} - \alpha \cdot \delta _{i}^{l+1}(z) \cdot y_{j}^{l+1}\\ \\ w_{ij}^{(1)} &{} \longleftarrow &{} w_{ij}^{(1)} - \beta \cdot \delta _{i}^{l}(v) \cdot \left[ x_{j} + \displaystyle \frac{\partial y_{i}^{l}}{\partial w_{ij}^{(1)}}\right] \\ \\ b_{i} &{} \longleftarrow &{} b_{i} - \beta \cdot \delta _{i}^{l}(v) \cdot \left[ 1 + \displaystyle \frac{\partial y_{i}^{l}}{\partial b_{i}}\right] \end{array} \right. \end{aligned}$$
(31)

where \(\alpha\) and \(\beta\) are the learning rates.

Directly solving the iteration equations (8) and (28) are not easy. Fortunately, these equations are decoupled for their variables. Next, we will define a math mode of a neuron in discrete time as well as an adjoint equation. These equations can be looked as basic functions to describe the learning algorithm.

The math model of a neuron in discrete time equation, called neuronDTE, is defined as

$$\begin{aligned} p^{l+1} = (1-\epsilon ) p^{l} + \epsilon \cdot \sin ^{2} \left( p^{l} + \gamma \right) , \end{aligned}$$
(32)

where \(\gamma\) can be looked as perception input to the neuron.

The adjoint equation of neuronDTE, called neuronADE, is defined as

$$\begin{aligned} \left\{ \begin{array}{ll} p^{l+1} = (1-\epsilon ) p^{l} + \epsilon \cdot \sin ^{2} \left( p^{l} + \gamma \right) \\ \\ q^{l+1} = (1-\epsilon ) q^{l} +\epsilon \cdot \sin 2\left( p^{l} + \gamma \right) \cdot \left( 1 + q^{l}\right) , \end{array} \right. \end{aligned}$$
(33)

Clearly, the neuronADE is a two-dimensional equation.

From the above analysis, a learning algorithm, called \(\epsilon\)-LA, can be described as:

Step 1: Input training data set X.

Step 2: Initialize parameters \(W^{(1)}\), \(W^{(2)}\), b, \(\alpha\), \(\beta\), \(\epsilon\).

Step 3: Select a mini-batch \(X_{m} \subset X\) and set \(l \leftarrow 0\).

Step 4: For each \(x \in X_{m}\), for each i, calculate \(y_{i}^{l+1}\) by neuronDTE (32) and the network output \(a_{i}^{l+1}\).

Step 5: Calculate \(\delta _{i}^{l+1}\left( z\right)\).

Step 6: Accumulate \(\frac{\partial J^{l+1}}{\partial w^{(2)}_{ij}}\) by (26).

Step 7: Calculate \(\delta ^{l}_{i}(v)\).

Step 8: Accumulate \(\frac{\partial J^{l+1}}{\partial w^{(1)}_{ij}}\) and \(\frac{\partial J^{l+1}}{\partial b_{i}}\) by (29) and (30) with \(q^{l}_{i}\).

Step 9: Calculate \(q^{l+1}\) by neuronADE (33).

Step 10: Update \(w_{ij}^{(2)}\), \(w_{ij}^{(1)}\), and \(b_{i}\) by using (31) with \(\frac{\partial J^{l+1}}{\partial w^{(2)}_{ij}}\), \(\frac{\partial J^{l+1}}{\partial w^{(1)}_{ij}}\) and \(\frac{\partial J^{l+1}}{\partial b_{i}}\).

Step 11: Back to Step 3 with \(l \leftarrow l + 1\) until \(W^{(1)}\), \(W^{(2)}\), and b converge.

The proposed \(\epsilon\)-LA has two obvious advantages: 1. Unlike traditional feedforward networks, it does not require to design network layer number in advance; 2. There is no problem of the gradient vanishing, since the gradient back propagation happens only in one layer each time. A detailed implementation of the \(\epsilon\)-LA is given in Appendix.

5 Experiments

In this section, we will conduct experiments to showcase the effectiveness of our proposed theory. The experiments will be divided into three subsections. In the first subsection, we will present the results of our experiments on the widely known MNIST data set for classification. In the second subsection, we will show the results of using real-world ultrasonic image data for breast cancer classification. We will employ the voting principle to train multiple models, and then use these models for voting. In the third subsection, we will evaluate the learning ability of \(\epsilon\)-LA on unlimited-length sequence data generated by the Embedded Reber Grammar. The experiments are implemented by PyTorch using NVIDIA RTX 3090 GPUs in an Ubuntu server.

5.1 Experiment on MNIST data

5.1.1 Classification results of nmODE

We first carry out the experiment of nmODE on the MNIST dataset, a well-known benchmark containing the handwritten digits from 0 to 9. There are 50, 000 and 10, 000 images are contained in the training set and testing set, respectively. Each image is in the size of \(28 \times 28\). Five nmODEs are independently trained by using the nmLA. The accuracy of each network and the voting results on the testing set are summarized in Table 1. It can be seen in the table, all nmODEs achieve competitive accuracy over 99.52%. By using the voting mechanism among the five models, the accuracy reaches 99.71%.

Table 1 Accuracy of the nmODE on the testing set of MNIST dataset

For the classification task on the MNIST dataset, previous neural differential equations such as SONODEs (Norcliffe et al. 2020), ANODEs (Dupont et al. 2019), and NODEs (Chen et al. 2018) use CNN blocks in the architecture, while the proposed nmODEs directly input the image. As shown in Table 2, nmODEs achieve the highest accuracy among all the other neural differential equations without any CNN layers, confirming the strong effectiveness of nmODEs.

Table 2 nmODE comparison to other neuralODEs on the testing set of MNIST dataset

The number of function evaluations (NFE) is a measure of the computational cost of an optimization algorithm, which directly affects the time and resources required to find an optimal solution. As shown in Table 2, nmODEs are able to achieve the highest accuracy with the lowest NFE among all the other neural differential equations, which reflects the fast speed of nmODEs.

5.1.2 Classification results of \(\epsilon\)-net

We further conduct experiments on the MNIST dataset by using the proposed \(\epsilon\)-net. Ten \(\epsilon\)-nets are independently trained by using the \(\epsilon\)-LA algorithm. The accuracy of each network and the voting result on the testing set are summarized in Table 3. As can be seen in the table, the majority \(\epsilon\)-nets achieve competitive accuracy over 98.4%, except for the 98.39% of the fourth model. By voting among the ten models, the accuracy is raised to 98.85%, confirming the effectiveness of the voting mechanism.

Table 3 Accuracy of the \(\epsilon\)-net on the testing set of MNIST dataset

5.1.3 Visualization of attractors

We have also created a visualization of the attractors for the MNIST dataset, which can be seen in Fig. 11. To produce this visualization, we used a network consisting of two nmODEs, with 2048 and 64 memory neurons, respectively. The attractors generated by the second nmODE were reshaped to \(8\times 8\) dimensions to make them suitable for visualization. As depicted in Fig. 11, the attractors belonging to the same class are very similar, while those belonging to different classes are markedly distinct. These results demonstrate that the proposed nmODE has effectively captured the essential characteristics of the ten classes in the MNIST dataset.

Fig. 11
figure 11

Visualization of attractors for the MNIST dataset. The attractors belonging to the same class are very similar, while those belonging to different classes are markedly distinct

In addition to the attractor visualization, we have also included t-SNE results for the MNIST dataset in Fig. 12.

Fig. 12
figure 12

t-SNE visualization of the MNIST dataset

It is evident that samples belonging to the same class are closely clustered together, while those belonging to different classes are clearly separated.

5.2 Experiment on CIFAR-10 data

We evaluated the proposed nmODE on the CIFAR-10 dataset, which is a widely used dataset for computer vision tasks. It comprises 60,000 32 × 32 color images in 10 different classes, including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class consists of 6000 images, out of which 5,000 are used for training and 1,000 for testing. The dataset is designed to evaluate image classification algorithms, with the aim of accurately assigning each image to one of the ten classes.

For the classification task on the CIFAR-10 dataset, previous neural differential equation models such as ANODEs (Dupont et al. 2019) use CNN blocks in the architecture to extract features from \(32 \times 32 \times 3\) RGB images. In contrast, the proposed nmODEs directly flatten the RGB image to 3072. Table 4 shows that nmODEs outperform ANODEs by achieving \(70.73\%\) accuracy, without the use of any CNN blocks.

Table 4 nmODE comparison to ANODE on the testing set of CIFAR-10 dataset

To further evaluate the proposed nmODE, we conducted experiments on the CIFAR-10 dataset using both nmODE and ViT-B/16 (Dosovitskiy et al. 2021). The network architecture comprised two parts: an ImageNet-pretrained ViT-B/16 that downscale the image from \(32 \times 32 \times 3\) to 512, and the proposed nmODEs. As shown in Table 5, nmODEs outperform ViT-B/16 by achieving \(98.73\%\) accuracy on the CIFAR-10 dataset.

Table 5 nmODE comparison to ViT-B/16 on the testing set of CIFAR-10 dataset

5.3 Experiment on breast cancer classification

We evaluated the proposed nmODE and \(\epsilon\)-net on a large-scale dataset for breast cancer classification. The task was to classify ultrasound images into medical assessment categories named Breast Imaging-Reporting And Data System (BI-RADS), including B1, B2, B3, B4a, B4b, B4c, and B5. The dataset consisted of a significant number of training and test images, as summarized in Table 6. We trained ten nmODEs and \(\epsilon\)-nets independently, each with a different number of hidden neurons.

The network architecture consisted of two parts. The first part was an ImageNet pretrained EfficientNet-B3 (Tan and Le 2019) that downscaled the image from \(256 \times 256 \times 3\) to 1536. The second part was either the proposed nmODE or \(\epsilon\)-net.

The number of ultrasound images in the training and testing sets are summarized in Table 6.

Table 6 Number of ultrasound images in the training and testing sets

5.3.1 Classification results

The experiment results for the nmODE and \(\epsilon\)-net are presented in Table 7. As seen in the table, the nmODE achieved an accuracy of \(83.42\%\), and the \(\epsilon\)-net achieved an accuracy of \(80.97\%\). All models had less than 1 M parameters and performed well, approaching human diagnostic ability.

Table 7 Accuracies of the \(\epsilon\)-net and nmODE on the testing set of Ultrasound dataset

5.3.2 Visualization of Attractors

We also visualized the attractors of the \(\epsilon\)-net for the Ultrasound dataset, as depicted in Fig. 13.

Fig. 13
figure 13

Visualization of attractors for the Ultrasound dataset. Each row represents ten samples belonging to the same class

The network used for visualization consisted of a MobileNetV3, and the proposed \(\epsilon\)-net with hidden neurons of 256. Each attractor \(y^{*}\) was reshaped to \(16\times 16\) dimensions for visualization. From the visualization results, it is clear that the attractors for the same class are quite similar, while those belonging to different classes are distinct.

In addition to the visualization of attractors, we also included t-SNE results for the Ultrasound dataset in Fig. 14.

Fig. 14
figure 14

t-SNE visualization of the Ultrasound dataset

These visualization results demonstrate that the proposed \(\epsilon\)-net has learned the essential characteristics of BI-RADS categories.

5.4 Experiment on sequence learning

In this subsection, we evaluated the proposed \(\epsilon\)-net and \(\epsilon\)-LA on an unlimited-length sequence learning task. The \(\epsilon\)-LA has the advantage of not requiring any prior knowledge of the number of network layers, allowing it to learn unlimited-length sequences. It should be noted that to enable the \(\epsilon\)-LA to learn sequences, the input x to the \(\epsilon\)-net must be variable along l, i.e., \(x=x^{l}\), where \(x^{l}\) is the l-th element in the input sequence. Then, the \(\epsilon\)-net can be rewritten as:

$$\begin{aligned} y^{l+1} = (1 - \epsilon ) \cdot y^{l} + \epsilon \cdot \sin ^{2}\left[ y^{l} + W^{(1)} x^{l} + b \right] . \end{aligned}$$
(34)

It can be easily proven from Subsection 4.2 that the \(\epsilon\)-LA still holds for the above situation.

In this study, we used a toy sequence learning benchmark, the Embedded Reber Grammar (ERG) (Hochreiter and Schmidhuber 1997; Wang et al. 2018). The ERG is a sequence generator that follows the grammar rules illustrated in Fig. 15.

Fig. 15
figure 15

The Embedded Reber Grammar is represented as a state machine, where each node has branching paths with a state transition probability of 0.5. The generated sequence is comprised of characters from the set \(\{\)“B”, “T”, “P”, “S”, “V”, “X”, “E”\(\}\)

The \(\epsilon\)-net faced two challenges in learning the ERG-generated sequences. Firstly, the sequence generated by the ERG may be of unlimited length due to the presence of inter-loops in the ERG graph. Secondly, the prediction of the subsequent "T" or "P" in red color depends on the preceding symbol, which requires the ability to learn long-term memory.

Dataset and implementation: In this experiment, 10,000 distinct sequences were randomly generated using the ERG, which were divided into two sets with \(70\%\) used for training and the remaining \(30\%\) for testing. To represent each symbol in the generated sequence, a 7-dimensional one-hot vector was used. The batch size for the training process was set at 64, and the learning rate was set at 0.001.

Experimental result: Various recurrent neural networks, such as LSTM (Hochreiter and Schmidhuber 1997), AMU-RNN Wang et al. (2018), and SRNN (He et al. 2022), have been used for this simple sequence learning task. Notably, the \(\epsilon\)-net achieved \(100\%\) accuracy with only 240 trainable parameters, whereas LSTM, AMU-RNN, and SRNN required 264, 883, and 675 parameters, respectively. The above findings demonstrate that nmODE possesses long-term memory abilities.

6 Discussions and conclusions

A theory of memory based neuralODE of artificial neural networks has been proposed in this paper. It comprises of two network models and two learning algorithms. One of the common problems of using neuralODEs in practical application is that the computation is very time-consuming, and sometimes even impossible if the neuralODEs are in high dimension. Existing neuralODEs usually use CNN blocks to transform high dimensional data into low dimensional data before fed into neuralODEs. The proposed nmODE, \(\epsilon\)-net, nmLA, and \(\epsilon\)-LA are built on the blocks of neuronODE (one-dimension), invODE (three-dimension), neuronDTE (one-dimension), and neuronADE (two-dimension), respectively. Since all of these blocks are in lower dimensions, they can be easily computed. This paper presented all the computations in serial form for clarity, nevertheless, all of them can be reformed in parallel computing. Experiments on classification tasks as well as a simple sequence learning task have been carried out in this paper to demonstrate the excellent performance of the proposed theory. Obviously, the methods of this paper could be further developed to solve other problems on artificial intelligence, such as segmentation, clustering, etc..

The theory presented in this paper could be further generalized in several aspects. Some of the possible generalizations are listed as follows:

  1. 1.

    The neuronODE could be generalized to be with more higher nonlinearity, say, one can define neuronODE as

    $$\begin{aligned} {\dot{p}} = -p + f_{k}\left[ p + f_{k-1} \left[ p + \cdots + f_{1} \left[ p + \gamma \right] \cdots \right] \right] \end{aligned}$$
    (35)

    for \(t \ge 0\) and \(k \ge 1\). An example for \(k=2\) could be

    $$\begin{aligned} {\dot{p}} = -p + \sin ^{2}\left[ p + \cos \left[ p + \gamma \right] \right] . \end{aligned}$$
    (36)

    The newly defined neuronODE can be used as block to construct more general nmODE as:

    $$\begin{aligned} {\dot{y}} = - y + f_{k}\left[ y + f_{k-1} \left[ y + \cdots + f_{1} \left[ y + W^{(1)} x + b \right] \cdots \right] \right] . \end{aligned}$$
    (37)

    We denote this model as nmODE\(^{k}\). If \(k=1\) and \(f_{1}(\cdot ) = \sin ^{2}(\cdot )\), it reduces to the nmODE.

  2. 2.

    The neuronODE could be used to replace neurons in traditional feedforward neural networks or recurrent neural networks to build more powerful artificial neural networks. An example of two-dimensional generalization could be as

    $$\begin{aligned} \left\{ \begin{array}{llll} {\dot{y}}_{1} = -y_{1} + f\left( {\tilde{w}}_{11} y_{1} + {\tilde{w}}_{12}y_{2} + b_{1} \right) \\ \\ {\dot{y}}_{2} = -y_{2} + f\left( {\tilde{w}}_{21} y_{2} + {\tilde{w}}_{22}y_{2} + b_{2} \right) \\ \\ y_{1}(0) = p_{1}(t)\\ \\ y_{2}(0) = p_{2}(t) \end{array} \right. \end{aligned}$$

    together with

    $$\begin{aligned} \left\{ \begin{array}{ll} {\dot{p}}_{1} = -p_{1} + g\left( p_{1} + W_{1}^{T} \cdot x \right) \\ \\ {\dot{p}}_{2} = -p_{2} + g\left( p_{1} + W_{2}^{T} \cdot x \right) \end{array} \right. \end{aligned}$$

    where x represents the external input, f and g are activation functions. Associated learning algorithms could be developed accordingly.

  3. 3.

    The nmODE uses external input to input data, which cannot be used for isomerous data. If there are two isomerous data sets, both x and y(0) could be used for data input.

  4. 4.

    The nmODE could be generalized to more complex networks by stacking nmODEs hierarchically to build networks with stronger representation capabilities. In this paper, we use the \(\sin ^{2}(\cdot )\) function as the activation function for nmODE, but other activation functions can also be used. If building a network by stacking nmODEs, it is recommended to use different activation functions for different nmODEs.

  5. 5.

    The \(\epsilon\)-LA learning algorithm is presented in discrete-time equations and does not require prior knowledge of the number of network layers. It could be generalized to an online learning algorithm in continuous form by using ODEs.

  6. 6.

    The neuronODE proposed in this paper can be generalized to a spike neuron dynamical model, which can be used to build novel spiking neural network models.

  7. 7.

    The neuronODE as well as the invODE are small ODEs with simple structure, they could be implemented by using electric circuits to speed up network training.

  8. 8.

    The nmODE possesses the capability to preserve long-term memory and could be leveraged for constructing pre-training of neural network models.

More research topics that could further generalize the theory presented in this paper can be listed out. In addition, the math model of neuron defined in this paper generalized the traditional model of neuron from a simple additive function to a dynamical system. The proposed neuronODE could be integrated to many well-known deep learning networks such as GAN networks, reinforcement learning networks, transformer networks, etc., expecting to achieve further good performance of the networks. It seems that we have opened a door for studying nmODE. We believe that the theory presented in this paper will find more successful practical applications in future.