Physics informed token transformer for solving partial differential equations

Cooper Lorsung; Zijie Li; Amir Barati Farimani

doi:10.1088/2632-2153/ad27e3

1. Introduction

Partial differential equations (PDEs) are ubiquitous in science and engineering applications. While much progress has been made in developing analytical and computational methods to solve the various equations, no complete analytical theory exists, and computational methods are often prohibitively expensive. Recent work has shown the ability to learn analytical solutions using bilinear residual networks [1], and bilinear neural networks [2–4], where an analytical solution is available. Many machine learning approaches have been proposed to improve simulation speed to calculate various fluid properties, where no such analytical solution is known to exist, including discrete mesh optimization [5–8], super resolution on lower resolution simulations [9, 10], and surrogate modeling [11–14]. While mesh optimization generally allows for using traditional numerical solvers, current methods only improve speed or accuracy by a few percent, or require many simulations during training. Methods for super resolution improve speed, but often struggle with generalizing to data resolutions not seen in the training data, with more recent work improving generalization capabilities [15]. Surrogate modeling, on the other hand, has shown a good balance between improved performance and generalization. Neural operator learning architectures, specifically, have also shown promise in combining super resolution capability with surrogate modeling due to their inherent discretization invariance [16]. Recently, the attention mechanism has become a popular choice for operator learning.

The attention mechanism first emerged as a promising model for natural language processing tasks [17–20], especially the scaled dot-product attention [18]. Its success has been extended to other areas, including computer vision tasks [21] and biology [22]. It has also inspired a wide array of scientific applications, in particular PDEs modeling [23–30]. Kovachki et al [16] proposes a kernel integral interpretation of attention. Cao [23] analyzes the theoretical properties of softmax-free dot product attention (also known as linear attention (LA)) and further proposes two interpretations of attention, such that it can be viewed as the numerical quadrature of a kernel integral operator or a Peterov–Galerkin projection. OFormer (Operator Transformer) [24] extends the kernel integral formulation of LA by adding relative positional encoding [31] and using cross attention to flexibly handle discretization, and further proposes a latent marching architecture for solving forward time-dependent problems. Guo et al [29] introduces attention as an instance-based learnable kernel for direct sampling method and demonstrates superiority on boundary value inverse problems. Learning operators with coupled attention [32] uses attention weights to learn correlations in the output domain and enables sample-efficient training of the model. General neural operator transformer for operator learning [25] proposes a heterogeneous attention architecture that stacks multiple cross-attention layers and uses a geometric gating mechanism to adaptively aggregate features from query points. Additionally, encoding physics-informed inductive biases has also been of great interest because it allows incorporatation of additional system knowledge, making the learning task easier. One strategy to encode the parameters of different instances for parametric PDEs is by adding conditioning module to the model [33, 34]. Another approach is to embed governing equations into the loss function, known as physics-informed neural networks (PINNs) [35]. PINNs have shown promise in physics-based tasks, but have some downsides. Namely, they show lack of generalization, and are difficult to train. Complex training strategies have been developed in order to account for these deficiencies [36].

While many existing works are successful in their own right, none so far have incorporated entire analytical governing equations. In this work we introduce an equation embedding strategy as well as an attention-based architecture, Physics Informed Token Transformer (PITT), to perform neural operator learning using equation information that utilizes physics-based inductive bias directly from governing equations (The main architecture of PITT is shown in figure 1). More specifically, PITT fuses the equation knowledge into the neural operator learning by introducing a symbolic transformer on top of the neural operator. We demonstrate through a series of challenging benchmarks that PITT outperforms the popular Fourier neural operator [13] (FNO), DeepONet [14], and OFormer [24] and is able to learn physically relevant information from only the governing equations and system specifications.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** The Physics Informed Token Transformer (PITT) uses standard multi head self-attention to learn a latent embedding of the governing equations. This latent embedding is then used to perform numerical updates using linear attention blocks. The equation embedding acts as an analytically-driven correction to an underlying data-driven neural operator.
Download figure:
Standard image High-resolution image

2. Methods

In this work, we aim to learn the operator $\mathcal{G}_{\theta}: \mathcal{A} \to \mathcal{U}$ , where $\mathcal{A}$ is our input function space, $\mathcal{U}$ is our solution function space, and θ are the learnable model parameters. We use a combination of novel equation tokenization and numerical method-like updates to learn model operators $\mathcal{G}_{\theta}$ . Our novel equation tokenization and embedding method is described first, followed by a detailed explanation of the numerical update scheme.

2.1. Equation tokenization

In order to utilize the text view of our data, the equations must be tokenized as input to our transformer. Following Lampe and Charton [37], each equation is parsed and split into its constituent symbols. The tokens are given in table 1.

Table 1. Collection of all tokens used in tokenizing governing equations, sampled values, and system parameters.

Category	Available tokens
Equation	$($ , $)$ , $\partial$ , Σ, j, A_j , l_j , ω_j, φ_j, sin, t, u, x, y, +, −, $*$ , /
Boundary conditions	$Neumann, Dirichlet, None$
Numerical	0, $1, 2, 3, 4, 5, 6, 7, 8, 9, 10\,\hat{}\,, E$ , e
Delimiter	, (comma), . (decimal point)
Separator	&
2D equations	$\nabla, = , \Delta, \cdot \textrm{(dot product)}$

The delimiter marks—decimal points for numerical values, commas for separating numerical values, and ampersand for separating equations, sampled values, and boundary conditions—are also added. The 1D equations are tokenized so the governing equation, forcing term, initial condition, sampled values, and output simulation time are all separated because each component controls distinct properties of the system. The 2D equations are tokenized so that the governing equations remain intact because some of the governing equations, such as the continuity equation, are self-contained. All of the tokens are then compiled into a single list, where each token in the tokenized equation is the index at which it occurs in this list. For example, we have the following tokenization:

$\begin{align*} \begin{split} \frac{\partial}{\partial t} u\left(x, t\right) & = \mathrm{Derivative}\left(u\left(x,t\right), t\right) \\ & = \left[\mathrm{Derivative}, \left(, u, \left(, x, ,, t, \right), ,, t, \right)\right] \\ & = \left[6,0,3,15,0,16,33,14,1,33,1\right]. \end{split} \end{align*}$

After each equation has been tokenized, the target time value is appended in tokenized form to the equation, and the total equation is padded with a placeholder token so that each text embedding is the same length. Sampled values are truncated at 15 digits of precision. Data handling code is adapted from PDEBench [38].

2.2. PITT

The PITT utilizes tokenized equation information to construct an update operator $F_\mathrm{P}$ , similar to numerical integration techniques: $x_{t+1} = x_t + F_{\mathrm{P}}(x_t)$ . We see in figure 1, PITT takes in the numerical values and grid spacing, similar to operator learning architectures such as FNO, as well as the tokenized equation and the explicit time differential between simulation steps. The tokenized equation is passed through a Multi Head Attention block seen in figure 1(a). In our case we use self attention (SA) [23]. The tokens are shifted and scaled to be between −1 and 1 upon input, which significantly boosts performance. This latent equation representation is then used to construct the keys and queries for a subsequent Multi Head Attention block that is used in conjunction with output from the underlying neural operator to construct the update values for the final input frame. The time difference between steps is encoded, allowing use of arbitrary timesteps. Intuitively, we can view the model as using a neural operator to passthrough the previous state, as well as calculate the update, like in numerical methods. The tokenized information is then used to construct an analytically driven update operator that acts as a correction to the neural operator state update. This intuitive understanding of PITT is explored with our 1D benchmarks.

Two different embedding methods are used for the tokenized equations. In the first method, the token attention block first embeds the tokens, T, as key, query, and values with learnable weight matrices: $T_1 = W_{T1} T$ , $T_2 = W_{T2} T$ , $T_3 = W_{T3} T$ . While this approach introduces unconventional correlations between tokens, only numerical values are modified in many of our experiments, and so the correlation between numerical values is useful. The second method uses standard fixed positional encoding [18] and lookup table embedding. The standard approach does not introduce unconventional correlations between numerical values. Dropout and Multi-head SA are then used to compute a hidden representation: $T_\mathrm{h} = \mathrm{Dropout}(\mathrm{SA}(T_1, T_2, T_3))$ . We use a single layer of self-attention for the tokens. The update attention blocks seen in figure 1(b) then uses the token attention block output as queries and keys, the neural operator output as values, and embeds them using trainable matrices as $V_0 = W_X X_0$ , $T_{h1} = W_{Th1} T_h$ , $T_{h2} = W_{Th2} T_h$ . The output is passed through a fully connected projection layer to match the target output dimension. This update scheme mimics numerical methods and is given in algorithm 1.

Algorithm 1. PITT numerical update scheme.
Require: V₀, $T_{h1}$ , $T_{h2}$ , time t, L layers
for $l = 1,2,\ldots,L$ do
$X_l \gets \mathrm{Dropout}(\mathrm{LA}(T_{h1}, T_{h2}, V_{l-1})$
$t_l \gets \mathrm{MLP}\left(\frac{l\cdot t}{L}\right)$
$V_l \gets V_{l-1} + \mathrm{MLP}(\left[X_{l}, t_{l}\right])$
end for

A standard, fully connected multi-layer perceptron is used to calculate the update after concatenating the attention output with an embedding of the fractional timestep. This block uses softmax-free LA [23], computed as $\mathbf{z} = Q\left(\tilde{K}^T\tilde{V}\right)/n$ . $\tilde{K}$ and $\tilde{V}$ indicate instance normalization. Note that the target time t is incremented fractionally to more closely model numerical method updates. Using multiple update layers and a fractional timestep is useful for long target times, such as steady-state or fixed-future type experiments. Using a single update layer works will with small timesteps, such as predicting the next simulation step.

3. Data generation

In order to properly assess performance, multiple data sets that represent distinct challenges are used. In the 1D case, we have the Heat equation, which is a linear parabolic equation, the KdV equation which is a nonlinear hyperbolic equation, and Burgers' equation. In 2D, we have the Navier Stokes equations and the steady state Poisson equation. Many parameters and forcing functions are sampled in order to generate large data sets.

3.1. Heat, Burgers', KdV equations

Following the setup from Brandstetter et al [39], we generate the 1D data. In this case, a large number of sampled parameters allow us to generate many different initial conditions and forcing terms for each equation. In our case, J = 5 and L = 16,

$\begin{align} \left[\partial_t u + \partial_x\left(\alpha u^2 - \beta \partial_x u + \gamma \partial_{xx} u\right) \right]\left(t, x\right) = \delta\left(t, x\right) \end{align} \tag{ 1 }$

where the forcing term is given by: $\delta\left(t, x\right) = \sum_{j = 1}^JA_j \text{sin}\left(\omega_j t + (2\pi l_j x)/L + \phi_j\right)$ , and the initial condition is the forcing term at time t = 0: $u\left(0, x\right) = \delta\left(0, x\right)$ . The parameters in the forcing term are sampled as follows: $A_j \sim \mathcal{U}(-0.5, 0.5), \,\omega_j \sim \mathcal{U}(-0.4, 0.4), \,l_j \sim \{1, 2, 3\}, \,\phi_j \sim \mathcal{U}(0, 2\pi)$ . The parameters, $(\alpha, \beta, \gamma)$ of equation (1) can be set to define different, famous equations. When γ = 0, β = 0 we have the Heat equation, when only γ = 0 we have Burgers' equation, and when β = 0 we have the KdV equation. Each equation has at least one parameter that we modify in order to generate large data sets.

For the Heat and Burgers' equations, we used diffusion values of $\beta \in \left\{0.01, 0.05, 0.1, 0.2, 0.5, 1 \right\}$ . For the Heat equation, we generated 10 000 simulations from each β value for 60 000 total samples. For Burgers' equation, we used advection values of $\alpha \in \left\{0.01, 0.05, 0.1, 0.2, 0.5, 1 \right\}$ , and generated 2500 simulations for each combination of values, for 90 000 total simulations. For the KdV equation, we used an advection value of α = 0.01, with $\gamma \in \left\{2,4,6,8,10,12\right\}$ , and generated 2500 simulations for each parameter combination, for 15 000 total simulations. The 1D equations text tokenization is padded to a length of 500. Tokenized equations are long here due to the many sampled values.

3.2. Navier–Stokes equation

In 2D, we use the incompressible, viscous Navier–Stokes equations in vorticity form, given in equation (2), Data generation code was adapted from Li et al [13],

$\begin{align} &\frac{\partial}{\partial t}w\left(x,t\right) + u\left(x,t\right)\cdot\nabla w\left(x,t\right) = \nu\Delta w\left(x,t\right) + f\left(x\right) \nonumber\\ &\nabla \cdot u\left(x,t\right) = 0, \quad w\left(x,0\right) = w_0\left(x\right) \nonumber\\ &f\left(x\right) = A\left(\text{sin}\left(2\pi\left(x_1 + x_2\right)\right) + \text{cos}\left(2\pi\left(x_1+x_2\right)\right)\right) \end{align} \tag{ 2 }$

where $u(x,t)$ is the velocity field, $w(x,t) = \nabla \times u(x,t)$ is the vorticity, $w_0(x)$ is the initial vorticity, f(x) is the forcing term, and ν is the viscosity parameter. We use viscosities $\nu\in\{10^{-9}, 2\cdot 10^{-9}, 3\cdot 10^{-9}, \ldots, 10^{-8},$ $2\cdot 10^{-8}, \ldots, 10^{-5} \}$ and forcing term amplitudes $A \in \left\{0.001, 0.002, 0.003, \ldots, 0.01\right\}$ , for 370 total parameter combinations. 120 frames are saved over 30 s of simulation time. The initial vorticity is sampled according to a gaussian random field. For each combination of ν and A, 1 random initialization was used for the next-step and rollout experiments and 5 random initializations were used for the fixed-future experiments. The tokenized equations are padded to a length of 100. Simulations are run on a $1\times1$ unit cell with periodic boundary conditions. The space is discretized with a $256\times256$ grid for numerical stability that is evenly downsampled to $64\times64$ during training and testing.

3.3. Steady-state Poisson equation

The last benchmark we perform is on the steady-state Poisson equation given in equation (3),

$\begin{align} \nabla^2 u\left(x,y\right) = g\left(x,y\right) \end{align} \tag{ 3 }$

where $u(x,y)$ is the electric potential, $-\nabla u(x,y)$ is the electric field, and $g(x,y)$ contains boundary condition and charge information. The simulation cell is discretized with 100 points in the horizontal direction and 60 points in the vertical direction. Capacitor plates are added with various widths, x and y positions, and charges. An example of input and target electric field magnitude is given in figure 2.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Example setup for the 2D Poisson equation. (a) Input boundary conditions and geometry. (b) Target electric field output.
Download figure:
Standard image High-resolution image

This represents a substantially different task compared to previous benchmarks. Due to the large difference between initial and final states, models must learn to extract significantly more information from provided input. This benchmark also easily allows for testing how well models are able to learn Neumann, and generalize to different combinations of boundary conditions. In two dimensions, we have four different boundaries on the simulation cell. Each boundary takes either Dirichlet or Neumann boundary conditions, allowing for 16 different combinations. In this case, since steady state is at infinite time, we pass the same time of 1 for each sample into the explicitly time-dependent models. The tokenized equation and system parameters are padded to a length of 100. Code is adapted from Zaman [40] for this case.

4. Results

We now compare PITT with both embedding methods against FNO, DeepONet, and OFormer on our various data sets. $\dagger$ indicates our novel embedding method and * indicates standard embedding. All experiments were run with five random splits of the data. Reported results and shaded regions in plots are the mean and one standard deviation of each result, respectively. Experiments were run with a 60-20-20 train-validation-test split. Early stopping is also used, where the epoch with lowest validation loss is used for evaluation. Note: parameter count represents total number of parameters. In some cases PITT variants use a smaller underlying neural operator and have lower parameter count than the baseline model. Hyperparameters for each experiment are given in the appendix.

4.1. 1D next-step prediction

Our 1D case is trained by using 10 frames of our simulation to predict the next frame. The data is generated for four seconds, with 100 timesteps, and 100 grid points between 0 and 16. The final time is T = 4 s. Specifically, the task is to learn the operator $\mathcal{G}_{\theta}:a(\cdot,t_i)|_{i\in[n-9, n]} \to u(\cdot,t_j)|_{j = n+1}$ where $n \in [10, 100]$ . A total of 1000 sampled equations were used in the training set, with 90 frames for each equation. Data was split such that samples from the same equation and forcing term did not appear in the training and test sets. Results are in table 2. We see PITT significantly outperforms all of the baseline models across all equations for both embedding methods. Although the lower error often resulted in unstable autoregressive rollout, PITT variants have also outperformed their baseline counterparts when simply trained to minimum error. Additionally, PITT is able to improve performance with fewer parameters than FNO, and a comparable number of parameters to both OFormer and DeepONet. Notably, PITT uses a single attention head and single multi-head attention block for the multi-head and LA blocks in this experiment.

Table 2. Mean absolute error (MAE) $\times 10^{-3}$ for 1D benchmarks. Bold indicates best performance.

Model	Parameter count	Heat	Burgers'	KdV
FNO	2.4 M	4.80 ± 0.18	8.22 ± 0.37	11.28 ± 0.43
PITT FNO $^\dagger$ (Ours)	0.2 M	0.38 ± 0.02	0.23 ± 0.06	8.77 ± 0.20
PITT FNO $^{*}$ (Ours)	0.4 M	0.38 ± 0.01	0.66 ± 0.07	8.68 ± 0.21
OFormer	3.0 M	1.44 ± 0.17	4.32 ± 0.35	4.36 ± 0.21
PITT OFormer $^\dagger$ (Ours)	0.4 M	0.06 ± 0.03	0.22 ± 0.02	0.46 ± 0.03
PITT OFormer $^{*}$ (Ours)	0.5 M	0.23 ± 0.13	0.24 ± 0.04	0.47 ± 0.02
DeepONet	0.2 M	0.68 ± 0.06	2.14 ± 0.17	9.22 ± 0.31
PITT DeepONet $^\dagger$ (Ours)	0.2 M	0.03 ± 0.01	0.24 ± 0.05	1.78 ± 0.10
PITT DeepONet $^{*}$ (Ours)	0.3 M	0.02 ± 0.01	0.21 ± 0.06	8.25 ± 0.28

The effect of the neural operator and token transformer modules in PITT can be easily decomposed and analyzed by returning the passthrough and update separately, instead of their sum (figure 1(c)). Using the pretrained PITT FNO from above, a sample is predicted for the 1D Heat equation. We see the decomposition in figure 3 and figure 6 in the appendix.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** PITT FNO prediction decomposition for 1D Heat equation. Left: the FNO module of PITT predicts a large change to the final frame of input data. Middle: the numerical update block corrects the FNO output. Right: the combination of FNO and numerical update block output very accurately predicts the next step.
Download figure:
Standard image High-resolution image

Interestingly, the underlying FNO has learned to overestimate the passthrough of the data in both cases. The token attention and numerical update modules have learned a correction to the FNO output, as expected.

4.2. 1D fixed-future prediction

In this 1D benchmark, each model is trained on all three equations simultaneously, and performance is compared against training on single equations. Results are shown in table 3. The first 10 frames of each equation are used as input to predict the last frame of each simulation. In total, 5000 samples from each equation were used for both single equation and multiple equation training. Models trained on the combined data sets are then tested on data from each equation individually. For PITT FNO and PITT OFormer, we see that training on the combined equations using our novel embedding method has best performance across all data sets. Additionally, for PITT FNO and PITT DeepONet, training using our standard embedding method acheivs best performance across all data sets. This shows PITT is able to improve neural operator generalization across different systems. Interestingly, we see also improvement in FNO and OFormer when training using the combined data sets.

Table 3. Mean absolute error (MAE) $\times 10^{-3}$ for 1D benchmarks. Bold indicates best performance.

Model	Parameter count	Heat	Burgers'	KdV
FNO	2.4 M	0.439 ± 0.005	0.528 ± 0.019	0.404 ± 0.004
FNO Multi	2.4 M	0.239 ± 0.002	0.285 ± 0.001	0.329 ± 0.002
PITT FNO $^\dagger$ (Ours)	0.2 M	0.177 ± 0.002	0.211 ± 0.005	0.220 ± 0.007
PITT FNO $^\dagger$ Multi (Ours)	0.2 M	0.120 ± 0.002	0.133 ± 0.002	0.165 ± 0.005
PITT FNO $^{*}$ (Ours)	0.3 M	0.158 ± 0.003	0.205 ± 0.003	0.194 ± 0.005
PITT FNO $^{*}$ Multi (Ours)	0.3 M	0.124 ± 0.005	0.135 ± 0.019	0.166 ± 0.004
OFormer	3.0 M	0.154 ± 0.003	0.192 ± 0.004	0.244 ± 0.004
OFormer Multi	3.0M	0.150 ± 0.003	0.166 ± 0.002	0.210 ± 0.002
PITT OFormer $^\dagger$ (Ours)	0.2 M	0.202 ± 0.008	0.222 ± 0.007	0.233 ± 0.004
PITT OFormer $^\dagger$ Multi (Ours)	0.2 M	0.142 ± 0.004	0.160 ± 0.005	0.191 ± 0.006
PITT OFormer $^{*}$ (Ours)	0.3 M	0.201 ± 0.006	0.228 ± 0.008	0.232 ± 0.006
PITT OFormer $^{*}$ Multi (Ours)	0.3 M	0.154 ± 0.003	0.170 ± 0.004	0.200 ± 0.004
DeepONet	0.2 M	0.240 ± 0.003	0.420 ± 0.008	0.519 ± 0.008
DeepONet Multi	0.2 M	0.608 ± 0.009	0.609 ± 0.006	0.749 ± 0.014
PITT DeepONet $^\dagger$ (Ours)	0.3 M	0.185 ± 0.002	0.355 ± 0.005	0.488 ± 0.007
PITT DeepONet $^\dagger$ Multi (Ours)	0.3 M	0.214 ± 0.009	0.330 ± 0.006	0.488 ± 0.007
PITT DeepONet $^{*}$ (Ours)	0.4 M	0.195 ± 0.006	0.320 ± 0.017	0.482 ± 0.009
PITT DeepONet $^{*}$ Multi (Ours)	0.4 M	0.187 ± 0.003	0.270 ± 0.008	0.481 ± 0.008

4.3. 2D benchmarks

The 2D benchmarks provided here provide a wider array of settings and tests for each model. In the next-step training and rollout test experiment, we used 200 equations, a single random initialization for each equation, and the entire 121 step trajectory for the data set. The final time is T = 30 s. Similar to the 1D case, we are learning the operator $\mathcal{G}_{\theta}:a(\cdot,t_i)|_{i = n} \to u(\cdot,t_j)|_{j = n+1}$ where $n \in [0, 119]$ . Results are given in table 4. This benchmark is especially challenging for two reasons. First, there are viscosity and forcing term amplitude combinations in the test set that the model has not trained on. Second, rollout is done starting from only the initial condition, and models are trained to predict the next step using a single snapshot. This limits the time evolution information available to models during training. Although the baseline models perform comparably to PITT variants in terms of error, we note that PITT shows improved accuracy for all variants, and in many cases lower error led to unstable rollout, like in the 1D cases. Despite this, PITT has much better rollout error accumulation, seen in table 6. Further analysis of PITT FNO attention maps from this experiment is given in the appendix in figures 7(a), (b), 8(a) and (b). The attention maps show PITT FNO is able to extract physically relevant information from the governing equations.

Table 4. Mean absolute error (MAE) ${\times} 10^{-3}$ for 2D benchmarks. Bold indicates best performance. Although PITT variants have overlapping error bars with the base model in the Navier–Stokes benchmark, the PITT variant had lower error on all but one random split of the data for PITT FNO, and every random split for PITT DeepONet.

Model	Parameter count	Navier–Stokes	Poisson
FNO	2.1 M/8.5 M	5.24 ± 0.30	9.79 ± 0.12
PITT FNO $^\dagger$ (Ours)	1.0 M/4.2 M	5.07 ± 0.30	1.15 ± 0.17
PITT FNO $^{*}$ (Ours)	1.7 M/4.0 M	5.18 ± 0.29	0.85 ± 0.05
OFormer	1.0 M/0.2 M	10.07 ± 0.94	9.98 ± 0.11
PITT OFormer $^\dagger$ (Ours)	0.9 M/2.0 M	14.63 ± 3.42	0.69 ± 0.38
PITT OFormer $^{*}$ (Ours)	1.2 M/2.2 M	20.54 ± 0.94	0.33 ± 0.02
DeepONet	0.3 M/0.4 M	7.06 ± 0.32	25.20 ± 0.22
PITT DeepONet $^\dagger$ (Ours)	1.7 M/3.2 M	7.01 ± 0.31	1.50 ± 1.40
PITT DeepONet $^{*}$ (Ours)	1.5 M/2.5 M	7.01 ± 0.32	0.53 ± 0.04

For the steady-state Poisson equation, for a given set of boundary conditions we learn the operator, $\mathcal{G}_{\theta}:a \to u$ , with Boundary conditions: $u(x) = g(x), \forall x \in \partial \Omega_0$ and ${\hat{\mathbf{n}}}\nabla u(x) = f(x), \forall x \in \partial \Omega_1$ . The primary challenge here is in learning the effect of boundary conditions. Dirichlet boundary conditions are constant, only requiring passing through initial values at the boundary for accurate prediction, but Neumann boundary conditions lead to boundary values that must be learned from the system. Standard neural operators do not offer a way to easily encode this information without modifying the initial conditions, while PITT uses a text encoding of each boundary condition, as outlined in equation tokenization. PITT is able to learn boundary conditions through the text embedding, and performs approximately an order of magnitude better, with the standard embedding improving over our novel embedding by an average of over 50%. 5000 samples were used during training with random data splitting. All combinations of boundary conditions appear in both the train and test sets. Results are given in table 4. Prediction error plots for our models on this data set are given in the appendix in figures 18 and 19.

Lastly, similar to experiments in both Li et al [13, 24], we can use our models to use the first 10 s of data to predict a fixed, future timestep. Including the initial condition, we use 41 frames to predict a single, future frame. In this case, we predict the system state at 20 and 30 s in two separate experiments. For this experiment, we are learning the operator $\mathcal{G}_{\theta}:u(\cdot,t)|_{t\in[0, 10]} \to u(\cdot,t)|_{t = 20,30}$ . We shuffle the data such that forcing term amplitude and viscosity combinations appear in both the training and test set, but initial conditions do not appear in both. Our setup is more difficult than in previous works because we are using multiple forcing term amplitudes and viscosities. The results are given in table 5, where we see PITT variants outperform the baseline model for both embedding methods. Example predictions are given in the appendix in figures 16 and 17.

Table 5. Mean absolute error (MAE) ${\times} 10^{-2}$ for 2D Fixed-Future Benchmarks. Bold indicates best performance.

Model	Parameter count	T = 20	T = 30
FNO	0.3 M	4.44 ± 0.05	8.11 ± 0.08
PITT FNO $^\dagger$ (Ours)	0.3 M	4.06 ± 0.13	7.26 ± 0.16
PITT FNO $^{*}$ (Ours)	1.6 M	4.02 ± 0.03	7.46 ± 0.09
OFormer	0.3 M	5.91 ± 0.16	8.83 ± 0.15
PITT OFormer $^\dagger$ (Ours)	0.5 M	5.64 ± 0.16	8.38 ± 0.07
PITT OFormer $^{*}$ (Ours)	1.6 M	5.75± 0.20	8.54 ± 0.09
DeepONet	0.3 M	10.28 ± 0.11	14.69 ± 0.19
PITT DeepONet $^\dagger$ (Ours)	0.5 M	8.96 ± 0.10	12.35 ± 0.09
PITT DeepONet $^{*}$ (Ours)	1.1 M	8.52 ± 0.08	11.33 ± 0.12

4.4. Rollout

An important test of viability for operator learning models as surrogate models is how error accumulates over time. In real-world predictions, we often must autoregressively predict into the future, where training data is not available. OFormer results are not presented here due to instability in autoregressive rollout. We see in table 6 that our PITT variants shows significantly less final error at large rollout times for all time-dependent data sets, with the exception of PITT DeepONet using our novel embedding method when compared to the baseline model on KdV. Error accumulation is shown in figure 4 for standard embedding, where PITT shows both lower final error and improved total error accumulation. The novel embedding error accumulation plot is given in the appendix in figure 9. In these experiments, we used the models trained in the next-step fashion from section our 1D benchmarks. We start with the first 10 frames from each trajectory in the test set for the 1D data sets and the only initial condition for the 2D test data set and autoregressively predict the entire rollout.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Error accumulation for rollout experiments using standard embedding.
Download figure:
Standard image High-resolution image

Table 6. Final mean absolute error (MAE) for rollout experiments. Bold indicates best performance when comparing base models to their PITT version. OFormer is omitted due to instability during rollout. Standard embedding PITT DeepONet is bolded here because it outperforms DeepONet for every random split of the data.

Model	Parameter count	1D Heat	1D Burgers'	1D KdV	2D NS
FNO	2.4 M/2.1 M	0.810 ± 0.042	1.063 ± 0.133	1.718 ± 0.114	0.125 ± 0.006
PITT FNO $^\dagger$ (Ours)	0.2 M/1.0 M	0.483 ± 0.015	0.351 ± 0.19	0.555 ± 0.050	0.065 ± 0.003
PITT FNO $^{*}$ (Ours)	0.2 M/1.0 M	0.511 ± 0.013	0.570 ± 0.020	0.529 ± 0.008	0.073 ± 0.004
OFormer	N/A	N/A	N/A	N/A	N/A
PITT OFormer $^\dagger$ (Ours)	N/A	N/A	N/A	N/A	N/A
PITT OFormer $^{*}$ (Ours)	N/A	N/A	N/A	N/A	N/A
DeepONet	0.2 M/0.8 M	0.562 ± 0.011	0.607 ± 0.0121	0.533 ± 0.010	0.179 ± 0.005
PITT DeepONet $^\dagger$ (Ours)	0.4 M/1.7 M	0.404 ± 0.100	0.536 ± 0.105	0.699 ± 0.048	0.154 ± 0.008
PITT DeepONet $^{*}$ (Ours)	0.3 M/1.7 M	0.217 ± 0.0138	0.484 ± 0.066	0.526 ± 0.011	0.157 ± 0.012

A visualization of 2D rollout for our novel embedding method is given in figure 5. At long rollout times, especially T = 25 s and T = 30 s, PITT FNO is able to accurately predict large-scale features, with accurate prediction of some of the smaller scale features. FNO, on the other hand, has begun to predict noticeably different features from the ground truth, and does not match small-scale features well. Similarly, PITT DeepONet is able to approximately match large-scale features in magnitude (lighter color), whereas DeepONet noticably differs even at large scales. 1D rollout comparison plots are given in the appendix.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Rollout results for 2D Navier Stokes using our novel embedding method.
Download figure:
Standard image High-resolution image

5. Conclusion

This work introduces a novel transformer-based architecture, PITT, that learns analytically-driven numerical update operators from governing equations. A novel equation embedding method is developed and compared against standard positional encoding and embedding. PITT is able to learn physically relevant information from tokenized equations and outperforms baseline neural operators on a wide variety of challenging 1D and 2D benchmarks. We have also found our baseline models and their PITT variants with both embedding strategies have lower time-to-solution than the numerical methods used for data generation. Details of the timing experiment and results are given in the appendix in tables 27 and 28 for our 1D next-step and fixed future experiments, respectively. Future work includes benchmarking on 3D systems, more effective tokenization and efficient embedding, as our novel method uses a naive approach that introduces unconventional correlation between tokens, but standard positional encoding and embedding does not use useful correlation between tokens. Additionally, the current experiments have redundancy in equations as only system parameters such as viscosity vary. Testing on multiple systems simultaneously would serve as a test for PITT's generalization capability. In addition, other works [13] have used recurrent rollout prediction as well as training rollout trajectories, which we have currently have not evaluated. These strategies can be employed to help stabilize rollout predictions.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 1953222.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/BaratiLab/PhysicsInformedTokenTransformer/tree/master [41].

Supplementary material

Supplementary material contains training hyperparameters, analysis of PITT attention maps, and further exploration of results.

Conflict of interest

The authors have no conflicts to disclose.

Author contributions

Cooper Lorsung: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Writing—Original Draft. Zijie Li: Conceptualization, Writing—Original Draft, Writing—Review & Editing. Amir Barati Farimani: Conceptualization, Resources, Writing—Review & Editing, Supervision, Funding acquisition.

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information