Continuous 3D Myocardial Motion Tracking via Echocardiography

Chengkang Shen    Hao Zhu    You Zhou    Yu Liu    Si Yi    Lili Dong    Weipeng Zhao    David J. Brady    Xun Cao    Zhan Ma    and Yi Lin This work was supported by the NSFC (No. 62331005, 62071219, 62025108) and the Science and Technology Commission of Shanghai Municipality (STCSM) (No. 20Y11909800).Chengkang Shen, Hao Zhu, Si Yi, Xun Cao and Zhan Ma are with the School of Electronic Science and Engineering, Nanjing University, Nanjing, 210023, China. You Zhou is with the Medical School, Nanjing University, Nanjing, 210093, China.Yu Liu, Lili Dong and Weipeng Zhao are with the Department of Echocardiography of Zhongshan Hospital, Fudan University, Shanghai, 200032, China.David J. Brady is with the College of Optical Sciences, University of Arizona, Tucson, AZ 85721, USA.Yi Lin is with the Department of Cardiovascular Surgery of Zhongshan Hospital, Fudan University, Shanghai, 200032, China.Chengkang Shen, Hao Zhu and You Zhou contributed equally to this work.Corresponding authors: Yi Lin, Zhan Ma, and Xun Cao. (e-mail: [email protected]; [email protected]; [email protected])This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Abstract

Myocardial motion tracking stands as an essential clinical tool in the prevention and detection of cardiovascular diseases (CVDs), the foremost cause of death globally. However, current techniques suffer from incomplete and inaccurate motion estimation of the myocardium in both spatial and temporal dimensions, hindering the early identification of myocardial dysfunction. To address these challenges, this paper introduces the Neural Cardiac Motion Field (NeuralCMF). NeuralCMF leverages implicit neural representation (INR) to model the 3D structure and the comprehensive 6D forward/backward motion of the heart. This method surpasses pixel-wise limitations by offering the capability to continuously query the precise shape and motion of the myocardium at any specific point throughout the cardiac cycle, enhancing the detailed analysis of cardiac dynamics beyond traditional speckle tracking. Notably, NeuralCMF operates without the need for paired datasets, and its optimization is self-supervised through the physics knowledge priors in both space and time dimensions, ensuring compatibility with both 2D and 3D echocardiogram video inputs. Experimental validations across three representative datasets support the robustness and innovative nature of the NeuralCMF, marking significant advantages over existing state-of-the-art methods in cardiac imaging and motion tracking. Code is available at: https://njuvision.github.io/NeuralCMF.

{IEEEkeywords}

Myocardial motion tracking, Echocardiogram videos, Implicit neural representation, Cardiac imaging

Refer to caption
Figure 1: We introduce a self-supervised method for estimating 3D motion trajectories throughout the cardiac cycle at any given point all at once. The figure illustrates the motion estimation of the 3D heart from end-diastole (ED) to end-systole (ES), representing the heart’s contraction phase. The estimated motion trajectories are displayed on the right with each point’s trajectory depicted in a different color for clarity. Though we have chosen to show only sparse trajectories of the left ventricular (LV) myocardium, our method has the ability to compute motion for all points across the entire 3D heart.

1 Introduction

\IEEEPARstart

Cardiovascular diseases (CVDs) stand as the primary global cause of death [1]. While CVDs encompass a spectrum of pathological disorders, including coronary atherosclerosis, valvular insufficiency, and inflammatory myocarditis, each impairs the heart’s motion and pumping function [2]. Therefore, accurate tracking of myocardial motion is pivotal for detecting CVDs. For instance, motion tracking is critical for early intervention in cardiotoxicity by detecting subclinical cardiac dysfunction post-cancer treatment [3], identifying regional contractility inconsistencies in ischemic diseases for coronary artery disease detection [4], and assessing risks and treatment success in valvular heart diseases [5, 6]. It also aids in the early detection of systolic and diastolic dysfunction in cardiomyopathies, forecasting cardiovascular events and patient outcomes [7, 8], optimizing cardiac resynchronization therapy [9], understanding hemodynamics in pulmonary hypertension and congenital heart diseases [10, 11], and managing atrial fibrillation [12, 13]. Overall, cardiac motion tracking is indispensable across a wide range of heart diseases for its diagnostic, prognostic, and therapeutic guidance capabilities. Currently, two primary modalities for myocardial motion tracking exist, namely echocardiography and cardiac magnetic resonance (CMR). Considering factors such as economics, prevalence, operational complexity, and temporal resolution, echocardiography is favored over CMR [14, 15, 16]. Consequently, developing and implementing an accurate and robust 3D myocardial motion tracking method based on echocardiography is essential in modern cardiology practice.

While numerous methods have been proposed to track myocardial motion in echocardiogram video, several challenges persist in existing solutions. In speckle-based tracking approaches [17, 18, 19, 16] which aim to identify consistent intensity patterns across successive frames by optimizing their similarities, capturing motion trajectories over extended temporal frames remains a formidable task due to the inherent limitation of estimating motions frame by frame. Surface-based tracking techniques [20, 21, 22, 23, 24, 25], on the other hand, concentrate on capturing similar curvature or shape characteristics of the chamber’s boundaries across sequential frames. However, the necessity for interpolating motion vectors for points within the boundary introduces inaccuracies, particularly in myocardium regions that are not fully infarcted. For instance, in cases of cardiac sarcoidosis [26, 27], where motion across the entire myocardium lacks uniformity. More recently, deep learning-based tracking methods have achieved remarkable success in analyzing deformations between images [28, 29, 30]. However, they face limitations in predicting the complete cardiac cycle’s motion and suffer from the lack of large-scale paired datasets. In summary, current myocardial motion tracking techniques with echocardiogram video input encounter three primary challenges: (1) accurate and complete estimation of myocardial motion throughout the entire cardiac cycle, (2) tracking any points in the 3D heart without resorting to interpolation, and (3) operating without the need for additional paired training datasets.

In this manuscript, we present Neural Cardiac Motion Field (NeuralCMF), an innovative self-supervised technique designed to simultaneously model the 3D cardiac structure and capture the 6D forward/backward motions of the myocardium. In contrast to prior methods that primarily focus on comparing neighboring frames pairwise, NeuralCMF takes a unique approach by modeling the global motion of all pixels throughout the entire video as a continuous implicit neural function, which has a coordinate input and the corresponding shape/motion output. Consequently, this enables us to precisely query the motion trajectory for any pixel spanning the complete cardiac cycle, as demonstrated in Fig. 1. Rather than depending on an extensive paired dataset for supervision, NeuralCMF employs a physics-informed self-supervision mechanism. This mechanism involves comparing the loss between the network output after physical imaging operators and the input videos during optimization (see Fig. 2). As a result, NeuralCMF effectively overcomes the dataset dependency commonly found in current deep learning-based methods, eliminating the limitations posed by data bias and enhancing its applicability and generalizability across a broad spectrum of ultrasound devices and patient populations. Furthermore, as the physical imaging operators can be modified according to 2D echocardiogram (2DE) or 3D echocardiogram (3DE) videos, NeuralCMF can be seamlessly integrated with both multi-view 2DE and 3DE video inputs. To the best of our knowledge, this universality for estimating 3D myocardial motion using echocardiogram videos from both 2D and 3D sources is a first in the community. The quantitative and qualitative experiments conducted on three representative datasets, spanning both 2DE and 3DE videos, show that the performance of our approach consistently outperforms the existing state-of-the-art methods [31, 30]. We believe the methodology presented here may also be applicable across more imaging modalities beyond ultrasonic imaging for analyzing the motion of organs other than the heart.

The contributions of this paper are summarized below:

  • We present NeuralCMF, a novel self-supervised method that simultaneously models the 3D structure of the heart and the 6D forward/backward motion of the myocardium. This method employs a personalized training strategy, where the network is specifically trained using the heart video data of an individual. With this approach, NeuralCMF is capable of estimating the 3D motion of any specific point in the heart throughout the cardiac cycle.

  • The capability of NeuralCMF to estimate dense 3D motion from multi-view 2DE images represents both innovation and practicality, rendering it particularly valuable in diverse critical care environments such as emergency departments and preoperative clinics.

  • To evaluate the performance and validity of our method, we have established unique multi-view 2DE and 3DE video datasets for cardiac imaging comprising data from 127 volunteers, covering both healthy individuals and patients with coronary heart disease (CHD).

  • NeuralCMF provides state-of-the-art performance in quantitative and qualitative assessments tested on three separate datasets: the open-source STRAUS [32], along with our specially constructed multi-view 2DE and 3DE video datasets.

2 Related Work

2.1 Cardiac Motion Tracking Methods

Motion tracking is crucial for precise strain calculations, and its objective is to map the displacement between consecutive images in a sequence. This section reviews existing motion tracking algorithms.

2.1.1 Intensity-based Methods

Speckle-tracking methods identify consistent intensity patterns across several consecutive time frames by maximizing the similarity of these patterns [18, 19]. These algorithms operate under the premise that there is only a small displacement motion between frames and that the intensity of a specific point within a moving object will remain unchanged. Nevertheless, the precision of speckle tracking tends to diminish when the temporal intervals between frames become substantial.

2.1.2 Feature-based Methods

Feature-based algorithms utilize image attributes such as surfaces, texture, or curvature, allowing consistent tracking over time [20, 22, 24]. These methods generate motion fields by matching the key points across frames based on features. The point matching algorithms include iterative closest point matching [33] and robust point matching [34]. Although efficient, these approaches are limited by the dependency on accurate segmentation and often produce sparse motion fields that need additional interpolation to generate motion in the desired locations.

2.1.3 Deep Learning-based Methods

State-of-the-art methods are driven by deep learning, gaining traction for their computational efficiency and pattern recognition capabilities. In the realm of supervised learning, a variety of methods have emerged: Rohé et al. [35] developed a method that leveraged a convolutional neural network (CNN) guided by the mesh segmentation of paired images to compute transformations between pairs of CMR images. Parajuli et al. [21] introduced the flow network tracker (FNT), employing a convolutional autoencoder to extract a point cloud from myocardium segmentation, followed by point tracking across time frames. Wu et al. [36] utilized a deep boltzmann machine (DBM) to capture both local and global heart shape variations, enabling frame-by-frame heart motion tracking in CMR images. However, these supervised methodologies predominantly rely on simulated datasets with known ground-truth deformations or expert-labeled myocardium segmentations, which are difficult to collect in real cases.

Limited ground truth motion labels have increased interest in unsupervised motion tracking. This has led to the evolution of methods employing diverse U-Net architectures [37] for cardiac motion tracking [28, 38, 39]. However, these methods have mostly been limited to the analysis of 2D heart images or CMR images, leaving motion tracking on 3DE an enduring challenge. Recent works, such as Ta et al. [29] with a proposed semi-supervised joint network and Ahn et al. [30] with a co-attention spatial transformer network, both for pairs of 3D heart images motion estimation, have shown promising advancements. Yet, these approaches remain limited in concurrently predicting the entire cardiac cycle’s motion, necessitating extensive training datasets and exhibiting susceptibility to data bias.

Deep learning-based methods have shown significant promise in cardiac motion tracking. Nonetheless, no methods can predict complete myocardial motion throughout the cardiac cycle without relying on simulated or expert-labeled datasets, a factor that is crucial for clinical use.

2.2 Implicit Neural Representation

The implicit neural representation (INR) [40], also referred to as neural fields [41], characterizes the function of a signal by mapping the coordinate to its attribute using a neural network. Different from previous deep-learning techniques [42], which are supervised by a large-scale paired dataset, the training of INR follows a self-supervised mechanism. This self-supervised mechanism can be embedded with various differentiable physics operators, making the INR naturally suitable for solving domain-specific inverse problems where large-scale paired datasets are unavailable. As a result, INR has been widely applied in various tasks, such as the neural rendering [41, 43], microscopy imaging [44, 45], material design [46] and partial differential equations solver [47].

Recently, INR has gained prominence in medical imaging. Focusing on the realm of 2D/3D image reconstruction, the prevalent modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound have seen numerous advancements. For instance, in the context of CT image reconstruction, Shen et al. [48] introduced a method that used prior images and INR to reconstruct high-quality 2D CT images from sparse samples. Zha et al. [49] proposed a method for sparse-view cone beam computed tomography (CBCT) reconstruction, using INR to model the attenuation field and optimize it by minimizing discrepancies between actual and synthesized projections. In the MRI reconstruction realm, Wu et al. [50] used high and low-resolution MRI image pairs to represent the 3D brain surface, processed the low-resolution images with a super-resolution CNN, and combined the extracted features with high-resolution coordinates in an MLP decoder to determine 3D voxel intensity. For ultrasound reconstruction, Shen et al. [51] proposed CardiacField to produce a high-fidelity 3D heart and automatically estimate heart function using only low-cost 2DE probes.

While numerous methods have employed INR to address challenges in medical imaging, none have explored its application for cardiac motion analysis in ultrasound modality, which is paramount in the clinical assessment of cardiac function.

3 Methodology

Refer to caption
Figure 2: The figure illustrates the workflow of our method. (a) Inputs consist of 4D space-time coordinates (x,y,z,t)𝑥𝑦𝑧𝑡(x,y,z,t)( italic_x , italic_y , italic_z , italic_t ). Corresponding outputs include the intensity value O(x,y,z,t)𝑂𝑥𝑦𝑧𝑡{O}(x,y,z,t)italic_O ( italic_x , italic_y , italic_z , italic_t ) at specified space-time locations, as well as forward and backward 3D motion vectors M=(mtt+1,mtt1)𝑀subscript𝑚𝑡𝑡1subscript𝑚𝑡𝑡1{\vec{M}}=(\vec{m}_{t\rightarrow t+1},\vec{m}_{t\rightarrow t-1})over→ start_ARG italic_M end_ARG = ( over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT , over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ) that represent the movements of (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) at successive time frames t+1𝑡1t+1italic_t + 1 and t1𝑡1t-1italic_t - 1, respectively. (b) Our approach can be applied to both 2DE and 3DE videos. (c) The methodology utilizes two main optimization strategies: physical motion consistency loss, ensuring that intensity remains constant within both the heart’s tissues and blood-filled areas throughout the entire cardiac cycle, and physical cardiac cycle consistency loss, which aligns the process with the physiological fact that the heart muscle returns to its initial shape upon completing one full cardiac cycle.

3.1 Overview

Aligned with our aim to attain a comprehensive understanding of cardiac dynamics, we present a self-supervised methodology that utilizes INR to accurately estimate the 3D motion of the myocardium throughout the entire cardiac cycle. The method is applicable to both 2DE and 3DE videos. As depicted in Fig. 2, the inputs of our method are the 4D space-time coordinates (x,y,z,t)superscript𝑥𝑦𝑧𝑡top(x,y,z,t)^{\top}( italic_x , italic_y , italic_z , italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT in the heart. The corresponding outputs consist of the intensity value O(x,y,z,t)𝑂𝑥𝑦𝑧𝑡{O}(x,y,z,t)italic_O ( italic_x , italic_y , italic_z , italic_t ) at the specified space-time location, along with the forward and backward motion vectors M={mtt+1,mtt1}𝑀subscript𝑚𝑡𝑡1subscript𝑚𝑡𝑡1{\vec{M}}=\{\vec{m}_{t\rightarrow t+1},\vec{m}_{t\rightarrow t-1}\}over→ start_ARG italic_M end_ARG = { over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT , over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT } of the location (x,y,z,t)superscript𝑥𝑦𝑧𝑡top(x,y,z,t)^{\top}( italic_x , italic_y , italic_z , italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT at successive time frames t+1𝑡1t+1italic_t + 1 and t1𝑡1t-1italic_t - 1, respectively.

Our approach extends our previous work, CardiacField [51], by integrating the temporal dimension into the network input, enabling the quantification of the motion of the myocardium throughout the entire cardiac cycle. In the following sections, we will begin by briefly reviewing the core of the CardiacField [51]. Subsequently, we will detail the NeuralCMF and its optimization strategy.

3.2 Background

Our previous study, CardiacField [51], focuses on reconstructing the 3D heart from a multi-view 2DE video captured from different angles. To achieve this, it represents a static 3D heart as a continuous function using the INR, denoted by Fθstaticsuperscriptsubscript𝐹𝜃𝑠𝑡𝑎𝑡𝑖𝑐F_{\theta}^{static}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_t italic_i italic_c end_POSTSUPERSCRIPT. This function takes 3D positions X=(x,y,z)𝑋superscript𝑥𝑦𝑧top\vec{X}=(x,y,z)^{\top}over→ start_ARG italic_X end_ARG = ( italic_x , italic_y , italic_z ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT as input and yields the corresponding volume intensity O(X)𝑂𝑋{O}(\vec{X})italic_O ( over→ start_ARG italic_X end_ARG ),

O(X)=Fθstatic(X),𝑂𝑋superscriptsubscript𝐹𝜃𝑠𝑡𝑎𝑡𝑖𝑐𝑋{O}(\vec{X})=F_{\theta}^{static}(\vec{X}),italic_O ( over→ start_ARG italic_X end_ARG ) = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_t italic_i italic_c end_POSTSUPERSCRIPT ( over→ start_ARG italic_X end_ARG ) , (1)

where the function Fθstaticsuperscriptsubscript𝐹𝜃𝑠𝑡𝑎𝑡𝑖𝑐F_{\theta}^{static}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_t italic_i italic_c end_POSTSUPERSCRIPT is modeled using a multi-resolution hash-based MLP network [52, 53]. To supervise the training of the network parameters in CardiacField [51], the physical imaging process of 2DE images is introduced to construct the physics-informed loss function. In detail, each 2DE image can be viewed as applying a virtual slicing operation to the 3D cardiac volume according to the angles of different images, i.e.,

𝐈s=𝒮23O(X),subscript𝐈𝑠superscriptsubscript𝒮23𝑂𝑋\mathbf{I}_{s}=\mathcal{S}_{2}^{3}\circ\mathcal{B}\circ{O}(\vec{X}),bold_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∘ caligraphic_B ∘ italic_O ( over→ start_ARG italic_X end_ARG ) , (2)

where 𝒮MNsuperscriptsubscript𝒮𝑀𝑁\mathcal{S}_{M}^{N}caligraphic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is the slicing operator that reduces an N𝑁Nitalic_N-dimensional function down to an M𝑀Mitalic_M-dimensional one by zeroing out the last NM𝑁𝑀N-Mitalic_N - italic_M dimensions. \mathcal{B}caligraphic_B characterizes the angle of different images and is used to transform the coordinates from the global world coordinate system to the image plane system. Subsequently, the 2DE images {𝐈si}i=1ksuperscriptsubscriptsuperscriptsubscript𝐈𝑠𝑖𝑖1𝑘{\{\mathbf{I}_{s}^{i}\}}_{i=1}^{k}{ bold_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT derived from O(X)𝑂𝑋{O}(\vec{X})italic_O ( over→ start_ARG italic_X end_ARG ), are compared to {𝐈gi}i=1ksuperscriptsubscriptsuperscriptsubscript𝐈𝑔𝑖𝑖1𝑘{\{\mathbf{I}_{g}^{i}\}}_{i=1}^{k}{ bold_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in the input 2DE images using a mean squared error loss.

CardiacField=1KiKvec(𝐈si)vec(𝐈gi)22,subscript𝐶𝑎𝑟𝑑𝑖𝑎𝑐𝐹𝑖𝑒𝑙𝑑1𝐾superscriptsubscript𝑖𝐾subscriptsuperscriptdelimited-∥∥vecsuperscriptsubscript𝐈𝑠𝑖vecsuperscriptsubscript𝐈𝑔𝑖22\mathcal{L}_{CardiacField}=\frac{1}{K}\sum_{i}^{K}\left\lVert\text{vec}(% \mathbf{I}_{s}^{i})-\text{vec}(\mathbf{I}_{g}^{i})\right\rVert^{2}_{2},caligraphic_L start_POSTSUBSCRIPT italic_C italic_a italic_r italic_d italic_i italic_a italic_c italic_F italic_i italic_e italic_l italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ vec ( bold_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - vec ( bold_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (3)

where K𝐾Kitalic_K refers to the number of input 2DE images, 𝐈sisuperscriptsubscript𝐈𝑠𝑖\mathbf{I}_{s}^{i}bold_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝐈gisuperscriptsubscript𝐈𝑔𝑖\mathbf{I}_{g}^{i}bold_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denote the i𝑖iitalic_i-th synthesized image from the network and the corresponding ground truth input image. The 2subscriptdelimited-∥∥2\lVert\cdot\rVert_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of the vector or matrix. The vec()vec\text{vec}(\cdot)vec ( ⋅ ) denotes the vectorization of a matrix. Finally, the loss CardiacFieldsubscript𝐶𝑎𝑟𝑑𝑖𝑎𝑐𝐹𝑖𝑒𝑙𝑑\mathcal{L}_{CardiacField}caligraphic_L start_POSTSUBSCRIPT italic_C italic_a italic_r italic_d italic_i italic_a italic_c italic_F italic_i italic_e italic_l italic_d end_POSTSUBSCRIPT is used to update the network parameters θ𝜃\thetaitalic_θ in Fθstaticsuperscriptsubscript𝐹𝜃𝑠𝑡𝑎𝑡𝑖𝑐F_{\theta}^{static}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_t italic_i italic_c end_POSTSUPERSCRIPT and the angle parameters in \mathcal{B}caligraphic_B following the backward gradient flow. Once the training of Fθstaticsuperscriptsubscript𝐹𝜃𝑠𝑡𝑎𝑡𝑖𝑐F_{\theta}^{static}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_t italic_i italic_c end_POSTSUPERSCRIPT converges, the 3D structure of the heart could be reconstructed by feeding the coordinate X𝑋\vec{X}over→ start_ARG italic_X end_ARG to the Fθstaticsuperscriptsubscript𝐹𝜃𝑠𝑡𝑎𝑡𝑖𝑐F_{\theta}^{static}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_t italic_i italic_c end_POSTSUPERSCRIPT one by one. As a result, CardiacField could be used to reconstruct the 3D heart from 2DE video without attaching any other position sensors.

3.3 Neural Cardiac Motion Field

In order to quantify the deformation of the myocardium, we extend the static 3D heart representation in CardiacField [51] to dynamic reconstruction and formulate the dense 3D motion field of the myocardium using the INR. Given a 3D position X𝑋\vec{X}over→ start_ARG italic_X end_ARG at a specific time t𝑡titalic_t, our model FθMotionsuperscriptsubscript𝐹𝜃𝑀𝑜𝑡𝑖𝑜𝑛F_{\theta}^{Motion}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_o italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT predicts not only the volume intensity O(Xt,t)𝑂subscript𝑋𝑡𝑡{O}(\vec{X}_{t},t)italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) but also the forward and backward motion vectors. These vectors are expressed as M(Xt,t)={mtt+1(Xt,t),mtt1(Xt,t)}𝑀subscript𝑋𝑡𝑡subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡{\vec{M}(\vec{X}_{t},t)}=\{\vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t),\vec{m}_{% t\rightarrow t-1}(\vec{X}_{t},t)\}over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = { over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) }. Specifically, the forward 3D offset vector mtt+1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is represented as (xf,yf,zf)superscriptsubscript𝑥𝑓subscript𝑦𝑓subscript𝑧𝑓top(x_{f},y_{f},z_{f})^{{\top}}( italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, detailing the displacement from time frame t𝑡titalic_t to t+1𝑡1t+1italic_t + 1. Similarly, the backward 3D offset vector mtt1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is denoted as (xb,yb,zb)superscriptsubscript𝑥𝑏subscript𝑦𝑏subscript𝑧𝑏top(x_{b},y_{b},z_{b})^{{\top}}( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, which encapsulates the displacement from time frame t𝑡titalic_t to t1𝑡1t-1italic_t - 1. These vectors provide a comprehensive description of the motion in 6D, accounting for spatial displacement in three dimensions over two time intervals. Accordingly, the dynamic model is mathematically defined as follows,

{O(Xt,t),M(Xt,t)}=FθMotion(Xt,t).𝑂subscript𝑋𝑡𝑡𝑀subscript𝑋𝑡𝑡superscriptsubscript𝐹𝜃𝑀𝑜𝑡𝑖𝑜𝑛subscript𝑋𝑡𝑡\{{O}(\vec{X}_{t},t),{\vec{M}(\vec{X}_{t},t)}\}=F_{\theta}^{Motion}(\vec{X}_{t% },t).{ italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) } = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_o italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) . (4)

As depicted in Fig. 2, the network FθMotionsuperscriptsubscript𝐹𝜃𝑀𝑜𝑡𝑖𝑜𝑛F_{\theta}^{Motion}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_o italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT consists of a static and dynamic stream. The static stream takes the 3D position vector Xtsubscript𝑋𝑡\vec{X}_{t}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as its input, while the dynamic stream’s input consists of the concatenation of the 3D position vector Xtsubscript𝑋𝑡\vec{X}_{t}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and time t𝑡titalic_t. The networks within the static Fθ1stsuperscriptsubscript𝐹subscript𝜃1𝑠𝑡F_{\theta_{1}}^{st}italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT and dynamic Fθ2dysuperscriptsubscript𝐹subscript𝜃2𝑑𝑦F_{\theta_{2}}^{dy}italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_y end_POSTSUPERSCRIPT streams are formulated using the SIREN network [40]. Each network has 4 layers, with 64 neurons in each hidden layer. The outputs of the two streams are static and dynamic feature vectors with length 32, respectively. The static feature vector captures the heart’s shape in the first frame. The dynamic feature vectors are essential to capture the transient changes of the myocardium that occur over time. These feature vectors are then concatenated and fed into a feature aggregation network Fθ3Agsuperscriptsubscript𝐹subscript𝜃3𝐴𝑔F_{\theta_{3}}^{Ag}italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_g end_POSTSUPERSCRIPT, comprising a single layer with 64 neurons. The output 𝐇𝐇\mathbf{H}bold_H of the feature aggregation network is further fed into three distinct layers (Lθ4subscript𝐿subscript𝜃4L_{\theta_{4}}italic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Lθ5subscript𝐿subscript𝜃5L_{\theta_{5}}italic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Lθ6subscript𝐿subscript𝜃6L_{\theta_{6}}italic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUBSCRIPT) to obtain the volume intensity O(Xt,t)𝑂subscript𝑋𝑡𝑡{O}(\vec{X}_{t},t)italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and motion vectors mtt+1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and mtt1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ),

𝐇=Fθ3Ag(Concat(Fθ1st(Xt),Fθ2dy(Xt,t)))O(Xt,t)=σ(Lθ4(𝐇))mtt+1(Xt,t)=tanh(Lθ5(𝐇))mtt1(Xt,t)=tanh(Lθ6(𝐇)),\begin{aligned} \mathbf{H}=F_{\theta_{3}}^{Ag}(\text{Concat}&(F_{\theta_{1}}^{% st}(\vec{X}_{t}),F_{\theta_{2}}^{dy}(\vec{X}_{t},t)))\\ {O}(\vec{X}_{t},t)&=\sigma(L_{\theta_{4}}(\mathbf{H}))\\ \vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)&=\tanh(L_{\theta_{5}}(\mathbf{H}))\\ \vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t)&=\tanh(L_{\theta_{6}}(\mathbf{H}))% \end{aligned},start_ROW start_CELL bold_H = italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_g end_POSTSUPERSCRIPT ( Concat end_CELL start_CELL ( italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_y end_POSTSUPERSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) ) end_CELL end_ROW start_ROW start_CELL italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_CELL start_CELL = italic_σ ( italic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_H ) ) end_CELL end_ROW start_ROW start_CELL over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_CELL start_CELL = roman_tanh ( italic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_H ) ) end_CELL end_ROW start_ROW start_CELL over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_CELL start_CELL = roman_tanh ( italic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_H ) ) end_CELL end_ROW , (5)

where θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the parameters of the corresponding neural network. σ𝜎\sigmaitalic_σ and tanh\tanhroman_tanh refer to the Sigmoid and Tanh activation functions, respectively.

3.4 Physics-informed Optimization

To supervise the training of the network parameters, a physics-informed loss function is proposed,

=α1imageStaticPrior+motion+cycleMotionPrior+α3regsubscriptsubscript𝛼1subscript𝑖𝑚𝑎𝑔𝑒𝑆𝑡𝑎𝑡𝑖𝑐𝑃𝑟𝑖𝑜𝑟subscriptsubscript𝑚𝑜𝑡𝑖𝑜𝑛subscript𝑐𝑦𝑐𝑙𝑒𝑀𝑜𝑡𝑖𝑜𝑛𝑃𝑟𝑖𝑜𝑟subscript𝛼3subscript𝑟𝑒𝑔\mathcal{L}=\underbrace{\alpha_{1}\mathcal{L}_{image}}_{Static\>Prior}+% \underbrace{\mathcal{L}_{motion}+\mathcal{L}_{cycle}}_{Motion\>Prior}+{\alpha_% {3}\mathcal{L}_{reg}}caligraphic_L = under⏟ start_ARG italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_S italic_t italic_a italic_t italic_i italic_c italic_P italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT + under⏟ start_ARG caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_M italic_o italic_t italic_i italic_o italic_n italic_P italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT (6)

where imagesubscript𝑖𝑚𝑎𝑔𝑒\mathcal{L}_{image}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT and motionsubscript𝑚𝑜𝑡𝑖𝑜𝑛\mathcal{L}_{motion}caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT, cyclesubscript𝑐𝑦𝑐𝑙𝑒\mathcal{L}_{cycle}caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT respectively model the static and dynamic physical knowledge priors, regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT regularizes the loss function. The terms α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT represent the weight of the physical imaging loss and regularization loss. Subsequent sections will provide a detailed elaboration of different components in the loss function.

3.4.1 Physical Imaging Loss

The physical imaging loss focuses on modeling the static 3D structure information in NeuralCMF according to the input echocardiogram videos. Because the physical imaging process of the 3DE varies from the 2DE, the formula for \mathcal{L}caligraphic_L varies accordingly:

imagesubscript𝑖𝑚𝑎𝑔𝑒\displaystyle\mathcal{L}_{image}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT (7)
=\displaystyle== {t=0T1XtO(Xt,t)𝐈g(Xt,t)223D  Echo.t=0T1CardiacField(Eqn.3)2D  Echo.,\displaystyle\left\{\begin{array}[]{ll}\sum_{t=0}^{T-1}\sum_{\vec{X}_{t}}\left% \lVert O(\vec{X}_{t},t)-\mathbf{I}_{g}(\vec{X}_{t},t)\right\lVert_{2}^{2}&% \text{3D\>\>Echo.}\\ \sum_{t=0}^{T-1}\mathcal{L}_{CardiacField}\>(\text{Eqn.}\>\ref{eqn:loss_phy_% loss})&\text{2D\>\>Echo.}\end{array}\right.,{ start_ARRAY start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - bold_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 3D Echo. end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_a italic_r italic_d italic_i italic_a italic_c italic_F italic_i italic_e italic_l italic_d end_POSTSUBSCRIPT ( Eqn. ) end_CELL start_CELL 2D Echo. end_CELL end_ROW end_ARRAY ,

where 𝐈gsubscript𝐈𝑔\mathbf{I}_{g}bold_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT refers to the input 3DE or 2DE videos, Xtsubscript𝑋𝑡\vec{X}_{t}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a 3D point at time t𝑡titalic_t and its definition domain is determined by the resolution of input videos, e.g., Xt{0,1,,159}3subscript𝑋𝑡superscript011593\vec{X}_{t}\in\{0,1,...,159\}^{3}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 , … , 159 } start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for an input 3DE video with resolution 1603superscript1603160^{3}160 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. When a 3DE video is given, the 3D structure of the heart can be supervised directly by comparing the network output O(Xt,t)𝑂subscript𝑋𝑡𝑡O(\vec{X}_{t},t)italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) with the input video. However, when a 2DE video is given, the slicing operation (Eqn. 2) in 2D echocardiography should be involved to generate a 2D image for comparison with the input 2D image. Note that the parameters for characterizing the angles of different 2D images are also optimized automatically in the training process.

3.4.2 Physical Motion Consistency Loss

In echocardiogram videos, the intensity remains consistent in both the heart’s tissues and blood-filled areas throughout the entire cardiac cycle [54]. Taking inspiration from this ‘consistent intensity’ prior, we propose the physical motion consistency loss, ensuring the continuity of intensity for any given point within the 3D heart over time. Given a point X𝑋\vec{X}over→ start_ARG italic_X end_ARG in the 3D heart at time t𝑡titalic_t, i.e., Xtsubscript𝑋𝑡\vec{X}_{t}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, its corresponding point at time t+1𝑡1t+1italic_t + 1 and t1𝑡1t-1italic_t - 1 can be calculated by,

Xt+1subscript𝑋𝑡1\displaystyle\vec{X}_{t+1}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =Xt+mtt+1(Xt,t)absentsubscript𝑋𝑡subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\displaystyle=\vec{X}_{t}+\vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)= over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (8)
Xt1subscript𝑋𝑡1\displaystyle\vec{X}_{t-1}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =Xt+mtt1(Xt,t).absentsubscript𝑋𝑡subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\displaystyle=\vec{X}_{t}+\vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t).= over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) .

Then, these new positions are fed into the FθMotionsuperscriptsubscript𝐹𝜃𝑀𝑜𝑡𝑖𝑜𝑛F_{\theta}^{Motion}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_o italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT to obtain the corresponding intensity values, forward and backward motions. Following the ‘consistent intensity’ prior, the physical motion consistency loss could be mathematically modeled as follows,

motionsubscript𝑚𝑜𝑡𝑖𝑜𝑛\displaystyle\mathcal{L}_{motion}caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT (9)
=\displaystyle== α1t=0T1Xti{±1}(O(Xt+i,t+i)O(Xt,t)22\displaystyle\alpha_{1}\sum_{t=0}^{T-1}\sum_{\vec{X}_{t}}\sum_{{i\in\{\pm 1\}}% }(\left\lVert{O}(\vec{X}_{t+i},t+i)-{O}(\vec{X}_{t},t)\right\rVert^{2}_{2}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ { ± 1 } end_POSTSUBSCRIPT ( ∥ italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , italic_t + italic_i ) - italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+\displaystyle++ α2Xt+i+mtti(Xt+i,t+i)Xt22),\displaystyle{\alpha_{2}\left\lVert\vec{X}_{t+i}+\vec{m}_{t\rightarrow t-i}(% \vec{X}_{t+i},t+i)-\vec{X}_{t}\right\rVert^{2}_{2}}),italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT + over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - italic_i end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , italic_t + italic_i ) - over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where T𝑇Titalic_T denotes the total frames in an entire cardiac cycle.

3.4.3 Physical Cardiac Cycle Consistency Loss

To enhance the precision of our model and align it with the physiological realities of cardiac function, we incorporate a physical cardiac cycle consistency loss. This loss reflects the fundamental cyclical behavior of cardiac motion throughout a single heart cycle, recognizing that the heart muscle returns to its initial shape after completing one cardiac cycle. The cycle consistency involves two primary constraints:

  • Intensity constraint. The intensity of a particular point at the end of a cardiac cycle should closely align with its intensity at the beginning of the cycle. This intensity consistency condition reinforces the model’s alignment with physical principles, ensuring the intensity is consistent across a full cycle.

  • Motion constraint. The heart’s cyclic motion dictates that the cumulative forward or backward motions over the course of one cardiac cycle should sum to zero. This constraint ensures that the cardiac model adheres to the biological principle that the heart returns to its original shape after a complete cycle.

Mathematically, these two constraints can be formulated as,

cyclesubscript𝑐𝑦𝑐𝑙𝑒\displaystyle\mathcal{L}_{cycle}caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT =α1t=0T1XtO(Xt+T,t+T)O(Xt,t)22absentsubscript𝛼1superscriptsubscript𝑡0𝑇1subscriptsubscript𝑋𝑡subscriptsuperscriptdelimited-∥∥𝑂subscript𝑋𝑡𝑇𝑡𝑇𝑂subscript𝑋𝑡𝑡22\displaystyle={\alpha_{1}}\sum_{t=0}^{T-1}\sum_{\vec{X}_{t}}\left\lVert{O}(% \vec{X}_{t+T},t+T)-{O}(\vec{X}_{t},t)\right\rVert^{2}_{2}= italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT , italic_t + italic_T ) - italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (10)
+α2i{±1}t=0T1mtt+i(Xt,t)22.subscript𝛼2subscript𝑖plus-or-minus1subscriptsuperscriptdelimited-∥∥superscriptsubscript𝑡0𝑇1subscript𝑚𝑡𝑡𝑖subscript𝑋𝑡𝑡22\displaystyle+\alpha_{2}\sum_{i\in\{\pm 1\}}\left\lVert\sum_{t=0}^{T-1}\vec{m}% _{t\rightarrow t+i}(\vec{X}_{t},t)\right\rVert^{2}_{2}.+ italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ { ± 1 } end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + italic_i end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

3.4.4 Regularization Loss

To reduce the noisy mismatch caused by the low signal-to-noise ratio in the input echocardiogram videos, two regularization terms are introduced in our model. The first term enforces consistency between the predicted forward and backward 3D motion vectors at each time step, ensuring that the forward motion vector mtt+1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), is congruent with the corresponding backward motion vector, mtt1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ).Additionally, Eqns. 9, the physical motion consistency loss, complements this by aligning the mtt1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and mtt+1(Xt1,t1)subscript𝑚𝑡𝑡1subscript𝑋𝑡1𝑡1\vec{m}_{t\rightarrow t+1}(\vec{X}_{t-1},t-1)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t - 1 ) to have equivalent magnitudes and opposite directions. Overall, these provisions ensure that the forward motion at each time step have the same magnitude and direction. This consistency contributes to smoother forward motion across consecutive time steps. Concurrently, the purpose of the second term is to limit the magnitude of the 3D motion vectors at each time step, effectively preventing the occurrence of excessively large motions. By combining these two terms, the model effectively eliminates unpredictable and excessive movements, ensuring the motion is fluid, consistent, and smooth. The regularization loss is thus formulated as follows,

reg=t=0T1Xt(mtt+1(Xt,t)+mtt1(Xt,t)1\displaystyle\mathcal{L}_{reg}=\sum_{t=0}^{T-1}\sum_{\vec{X}_{t}}(\left\lVert% \vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)+\vec{m}_{t\rightarrow t-1}(\vec{X}_{% t},t)\right\rVert_{1}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (11)
+mtt+1(Xt,t)1+mtt1(Xt,t)1).\displaystyle+\left\lVert\vec{m}_{t\rightarrow t+1}(\vec{X}_{t},t)\right\rVert% _{1}+\left\lVert\vec{m}_{t\rightarrow t-1}(\vec{X}_{t},t)\right\rVert_{1}).+ ∥ over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t - 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

The algorithm presented in Algorithm 1 outlines the detailed pipeline of the proposed NeuralCMF. This algorithm aims to reconstruct the 3D structure and motion vectors from input multi-view 2DE or 3DE videos, all without requiring paired datasets. Initially, a batch of 4D coordinates is randomly selected and normalized (line 5). Subsequently, the intensity and motion vectors are obtained by inputting these coordinates into the respective neural networks (lines 6-9). Following this, the intensity and motion vectors are computed after moving forward and backward through one cardiac cycle (lines 10-14). In the fourth step, various loss functions are computed based on the definitions provided in Eqns. 7 through 11 (lines 15). Finally, the combined loss function (line 16) supervises the update of the network parameters (line 17). These steps are iterated thousands of times. Upon convergence of training, the reconstructed motion vectors at any given space-time point can be obtained by inputting the coordinates into the network (line 6).

Input: Spatial-temporal coordinates set {(Xt,t)}subscript𝑋𝑡𝑡\{(\vec{X}_{t},t)\}{ ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) }. Multi-view 2DE/3DE video within one cardiac cycle.
1 Initialize the parameters θ𝜃\thetaitalic_θ of Fθmotionsuperscriptsubscript𝐹𝜃𝑚𝑜𝑡𝑖𝑜𝑛F_{\theta}^{motion}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT.
2 Normalize coordinates within the range of [0,1]01[0,1][ 0 , 1 ].
3 for each training iteration do
4       Randomly select a batch of (Xt,t)subscript𝑋𝑡𝑡(\vec{X}_{t},t)( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ).
5       Compute O(Xt,t),M(Xt,t)𝑂subscript𝑋𝑡𝑡𝑀subscript𝑋𝑡𝑡{O}(\vec{X}_{t},t),{\vec{M}(\vec{X}_{t},t)}italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) by Eqn. 4.
6       Compute Xt+1subscript𝑋𝑡1\vec{X}_{t+1}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and Xt1subscript𝑋𝑡1\vec{X}_{t-1}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT by Eqn. 8.
7       Compute O(Xt+1,t+1),M(Xt+1,t+1)𝑂subscript𝑋𝑡1𝑡1𝑀subscript𝑋𝑡1𝑡1{O}(\vec{X}_{t+1},t+1),{\vec{M}(\vec{X}_{t+1},t+1)}italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_t + 1 ) , over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_t + 1 ) by Eqn. 4.
8       Compute O(Xt1,t1),M(Xt1,t1)𝑂subscript𝑋𝑡1𝑡1𝑀subscript𝑋𝑡1𝑡1{O}(\vec{X}_{t-1},t-1),{\vec{M}(\vec{X}_{t-1},t-1)}italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t - 1 ) , over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t - 1 ) by Eqn. 4.
9       for i from 1 to T do
10             Compute O(Xt+i,t+i),M(Xt+i,t+i)𝑂subscript𝑋𝑡𝑖𝑡𝑖𝑀subscript𝑋𝑡𝑖𝑡𝑖{O}(\vec{X}_{t+i},t+i),{\vec{M}(\vec{X}_{t+i},t+i)}italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , italic_t + italic_i ) , over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , italic_t + italic_i ) by Eqn. 4.
11             Compute O(Xti,ti),M(Xti,ti)𝑂subscript𝑋𝑡𝑖𝑡𝑖𝑀subscript𝑋𝑡𝑖𝑡𝑖{O}(\vec{X}_{t-i},t-i),{\vec{M}(\vec{X}_{t-i},t-i)}italic_O ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT , italic_t - italic_i ) , over→ start_ARG italic_M end_ARG ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT , italic_t - italic_i ) by Eqn. 4.
12             Compute X(t+i)+1subscript𝑋𝑡𝑖1\vec{X}_{(t+i)+1}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( italic_t + italic_i ) + 1 end_POSTSUBSCRIPT and X(ti)1subscript𝑋𝑡𝑖1\vec{X}_{(t-i)-1}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( italic_t - italic_i ) - 1 end_POSTSUBSCRIPT by Eqn. 8.
13            
14       end for
15      Compute each loss term using Eqns. 7, 9, 10 and 11.
16       =image+motion+cycle+regsubscript𝑖𝑚𝑎𝑔𝑒subscript𝑚𝑜𝑡𝑖𝑜𝑛subscript𝑐𝑦𝑐𝑙𝑒subscript𝑟𝑒𝑔\mathcal{L}=\mathcal{L}_{image}+\mathcal{L}_{motion}+\mathcal{L}_{cycle}+% \mathcal{L}_{reg}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT.
17       θ=argminθsuperscript𝜃subscript𝜃\theta^{*}=\arg\min_{\theta}\mathcal{L}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L.
18 end for
Algorithm 1 Training Details of NeuralCMF

4 Experiments

4.1 Datasets

We employ three distinct echocardiogram video datasets to evaluate the proposed NeuralCMF, each contributing unique characteristics and challenges.

Refer to caption
Figure 3: This figure presents the qualitative results of point tracking using our method, VoxelMorph [31], and Co-AttentionSTN [30] on the STRAUS [32], 2DE, and 3DE video datasets. The query points that need to be tracked during the ED phase are displayed in the first column. The following three columns visually demonstrate the movement of these points across the full cardiac cycle. In the 2DE video datasets, only our results are showcased since other methods are incompatible with multi-view 2DE datasets.

4.1.1 STRAUS Datasets

The open-source 3D strain assessment in ultrasound (STRAUS) dataset [32] consists of 8 distinct volumetric sequences, each corresponding to a specific physiological condition. These include one healthy sequence, four ischemic cases, and three simulations of dilated cardiomyopathy. The ischemic cases are characterized by occlusions in various coronary arteries: the proximal or distal parts of the left anterior descending coronary artery (LADprox and LADdist, respectively), the left circumflex coronary artery (LCX), and the right coronary artery (RCA). The simulations of dilated cardiomyopathy consist of one case with a synchronous activation pattern (sync) and two cases of dyssynchrony due to left bundle branch block (LBBBsmall and LBBBlarge). These dyssynchronous cases are further distinguished by a progressively longer delay in activating the septum and lateral wall. Importantly, this dataset is accompanied by ground truth motion vectors, which have been derived from the movement of the left ventricular (LV) mesh. These vectors are instrumental in modeling the heart’s motion, providing a valuable reference for analyzing and validating computational approaches. The 3D video in each data has a resolution 224×176×208224176208224\times 176\times 208224 × 176 × 208 with 34343434 frames. This comprehensive dataset facilitates a robust examination of various heart conditions and serves as a valuable resource for analyzing and validating cardiac motion modeling.

4.1.2 2DE and 3DE Video Datasets

This dataset contains 127 video pairs from different volunteers, featuring both 2DE and corresponding 3DE videos for each individual. The 2DE video dataset was acquired using commercial PHILIPS EPIQ 7C/IE ELITE machines with S5-1 2DE probes and SIEMENS ACUSON SC2000 PRIME machines with a 4V1c 2DE probe. During the imaging process, sonographers performed 360-degree rotations around the apex of the heart for each individual volunteer. Subsequently, the 2DE videos were synchronized based on concurrently recorded ECG signals. Any extraneous elements, such as text, electrocardiogram information, or other unrelated details outside the scanning sector, were cropped and masked. The images were subsequently resized to 160×160160160160\times 160160 × 160-pixel dimensions utilizing a bicubic interpolation filter and consist of 36 frames. In addition to the 2DE videos, the 3DE video dataset was acquired for each volunteer using commercial PHILIPS EPIQ 7C machines equipped with an X5-1 3DE probe. These 3D datasets offer a resolution of 160×160×160160160160160\times 160\times 160160 × 160 × 160 pixels and contain 12 frames. These 3D images were captured using a standard four-chamber apical view.

4.2 Implementation Details

The proposed NeuralCMF is implemented using the PyTorch framework. We use the FusedAdam optimizer, which is an implementation of the Adam optimizer [55] that fuses multiple CUDA kernels and uses half-precision arithmetic to improve GPU memory usage and performance. The learning rate, initialized as 0.0001, is decayed according to the cosine annealing schedule. We trained our model for 10000100001000010000 iterations with 8192 points sampled in each iteration, and α1=1subscript𝛼11\alpha_{1}=1italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, α2=0.1subscript𝛼20.1\alpha_{2}=0.1italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.1, α3=0.01subscript𝛼30.01\alpha_{3}=0.01italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0.01. For every individual point, the position coordinate X𝑋\vec{X}over→ start_ARG italic_X end_ARG is normalized to the range [0,1]01\left[0,1\right][ 0 , 1 ]. The time t𝑡titalic_t is subjected to repetition with the period of the cardiac cycle to align our implementation with the cyclic nature of the heartbeat,

tnorm=tmodTT.subscript𝑡𝑛𝑜𝑟𝑚modulo𝑡𝑇𝑇t_{norm}={\frac{t\mod T}{T}}.italic_t start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT = divide start_ARG italic_t roman_mod italic_T end_ARG start_ARG italic_T end_ARG . (12)

For both the STRAUS dataset and the 3DE video dataset, NeuralCMF is designed to take 3D video sequences as input. We sample the dataset, extracting 4 images per cardiac cycle. With regard to the 2DE video datasets, NeuralCMF takes the 2D images as input, and our model processes 200 different view sequences.

4.3 Results

Table 1: Quantitative comparisons on STRAUS, 2DE, and 3DE video datasets. For both the 3DE and 2DE video datasets, the ground truth motion is unavailable, making it impossible to compute the MTE and Cosine similarity. In the 2DE video datasets, only our results are showcased since other methods are incompatible with multi-view 2DE datasets.
Datasets Methods Median tracking error (mm) ↓ Cosine similarity ↑ DICE ↑ HD (mm) ↓ Jaccard index ↑
STRAUS (LV Area) VoxelMorph [31] 3.72±0.32plus-or-minus3.720.323.72\pm 0.323.72 ± 0.32 0.68±0.04plus-or-minus0.680.040.68\pm 0.040.68 ± 0.04 0.74±0.05plus-or-minus0.740.050.74\pm 0.050.74 ± 0.05 3.08±0.12plus-or-minus3.080.123.08\pm 0.123.08 ± 0.12 0.58±0.06plus-or-minus0.580.060.58\pm 0.060.58 ± 0.06
Co-AttentionSTN [30] 4.41±0.48plus-or-minus4.410.484.41\pm 0.484.41 ± 0.48 0.69±0.08plus-or-minus0.690.080.69\pm 0.080.69 ± 0.08 0.71±0.04plus-or-minus0.710.040.71\pm 0.040.71 ± 0.04 3.30±0.10plus-or-minus3.300.103.30\pm 0.103.30 ± 0.10 0.56±0.05plus-or-minus0.560.050.56\pm 0.050.56 ± 0.05
Ours 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04 0.82±0.02plus-or-minus0.820.02\textbf{0.82}\pm\textbf{0.02}0.82 ± 0.02 3.02±0.01plus-or-minus3.020.01\textbf{3.02}\pm\textbf{0.01}3.02 ± 0.01 0.69±0.03plus-or-minus0.690.03\textbf{0.69}\pm\textbf{0.03}0.69 ± 0.03
3D Echo (LV Area) VoxelMorph [31] - - 0.69±0.06plus-or-minus0.690.060.69\pm 0.060.69 ± 0.06 3.11±0.14plus-or-minus3.110.143.11\pm 0.143.11 ± 0.14 0.63±0.17plus-or-minus0.630.170.63\pm 0.170.63 ± 0.17
Co-AttentionSTN [30] - - 0.65±0.10plus-or-minus0.650.100.65\pm 0.100.65 ± 0.10 3.20±0.23plus-or-minus3.200.233.20\pm 0.233.20 ± 0.23 0.61±0.09plus-or-minus0.610.090.61\pm 0.090.61 ± 0.09
Ours - - 0.77±0.07plus-or-minus0.770.07\textbf{0.77}\pm\textbf{0.07}0.77 ± 0.07 3.04±0.05plus-or-minus3.040.05\textbf{3.04}\pm\textbf{0.05}3.04 ± 0.05 0.68±0.03plus-or-minus0.680.03\textbf{0.68}\pm\textbf{0.03}0.68 ± 0.03
3D Echo (LA Area) Ours - - 0.78±0.04plus-or-minus0.780.04\textbf{0.78}\pm\textbf{0.04}0.78 ± 0.04 3.04±0.03plus-or-minus3.040.03\textbf{3.04}\pm\textbf{0.03}3.04 ± 0.03 0.68±0.05plus-or-minus0.680.05\textbf{0.68}\pm\textbf{0.05}0.68 ± 0.05
(RA Area) Ours - - 0.75±0.06plus-or-minus0.750.06\textbf{0.75}\pm\textbf{0.06}0.75 ± 0.06 3.20±0.19plus-or-minus3.200.19\textbf{3.20}\pm\textbf{0.19}3.20 ± 0.19 0.67±0.07plus-or-minus0.670.07\textbf{0.67}\pm\textbf{0.07}0.67 ± 0.07
2D Echo (LV Area) Ours - - 0.75±0.08plus-or-minus0.750.08\textbf{0.75}\pm\textbf{0.08}0.75 ± 0.08 3.24±0.40plus-or-minus3.240.40\textbf{3.24}\pm\textbf{0.40}3.24 ± 0.40 0.67±0.04plus-or-minus0.670.04\textbf{0.67}\pm\textbf{0.04}0.67 ± 0.04
(LA Area) Ours - - 0.75±0.05plus-or-minus0.750.05\textbf{0.75}\pm\textbf{0.05}0.75 ± 0.05 3.20±0.12plus-or-minus3.200.12\textbf{3.20}\pm\textbf{0.12}3.20 ± 0.12 0.67±0.03plus-or-minus0.670.03\textbf{0.67}\pm\textbf{0.03}0.67 ± 0.03
(RA Area) Ours - - 0.74±0.09plus-or-minus0.740.09\textbf{0.74}\pm\textbf{0.09}0.74 ± 0.09 3.25±0.21plus-or-minus3.250.21\textbf{3.25}\pm\textbf{0.21}3.25 ± 0.21 0.67±0.09plus-or-minus0.670.09\textbf{0.67}\pm\textbf{0.09}0.67 ± 0.09
Refer to caption
Figure 4: This figure illustrates the tracking performance of our method, VoxelMorph, and Co-AttentionSTN for points on the left atrial (LA) myocardium within 3DE video datasets. The initial column highlights the reference points set for tracking during the ED phase. The subsequent three columns visually demonstrate the movement of these points across the full cardiac cycle. It is of significance to note that our method demonstrates superior accuracy in tracking points on the mitral valve annulus, a challenge not achieved by either VoxelMorph or Co-AttentionSTN.

4.3.1 Evaluation Metrics

The experimental investigations conduct on the STRAUS dataset are evaluated using the provided ground-truth motion vectors. This facilitates the calculation of two key metrics: the median tracking error (MTE) and the cosine similarity. The MTE metric, determined by the 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm between the ground truth and predicted motion vectors, evaluates the quantitative accuracy of the tracking, while the cosine similarity serves as a measure of the alignment of the directional components of the motion vectors, with 1 signifying perfect alignment and -1 indicating completely opposite directions. Together, these metrics provide a comprehensive evaluation of both the magnitude and direction of the computed motion fields.

In the case of the 2DE and 3DE video datasets [51], where ground-truth motion vectors are unavailable, we rely on segmentation-based assessments. We employ metrics such as the hausdorff distance (HD), dice similarity Coefficient and jaccard index, comparing them against provided segmentation masks across two phases. The segmentation masks are labeled by an experienced echocardiographer. Using the learned motion vectors, we warp the 3D images from the end-diastolic (ED) phase to the end-systolic (ES) phase. Subsequently, we compare the myocardium segmentation of the LV in the four-chamber apical view from the warp ES phase to the myocardium segmentation in the real ES phase.

The NeuralCMF is evaluated against various state-of-the-art registration/motion tracking algorithms across each dataset for a robust quantitative comparison. Specific comparisons are made with other deep learning-based methods, including VoxelMorph [31] and Co-AttentionSTN [30]. The STRAUS dataset comprises 8 different 3DE sequences, we employ a 5/1/2 split strategy—utilizing 5 sequences for training, 1 for validation, and 2 for testing. Recognizing the constraints posed by the relatively small dataset, we have implemented a strategy to enhance the diversity and relevance of our training set, following the precedent established by [30]. Specifically, we select additional frame pairs from each sequence that exhibited significant variations in the cardiac cycle, indicated by a DICE overlap below 0.7. This method yields a robust dataset of 415 frame pairs for training and 83 for validation. This strategic selection ensures a more equitable comparison across different methods. For our 3D echo datasets, we prepare 594 images for training, 91 for validation, and select sequences from 100 individuals for testing. It is important to note that unlike the above-mentioned other deep learning-based methods, NeuralCMF does not rely on a training dataset. We directly apply our method to the testing dataset to obtain motion results, which are then used for further comparison. This direct application streamlines the process, eliminating the need for a training phase and showcasing NeuralCMF’s efficiency and adaptability in handling motion tracking tasks across datasets.

Refer to caption
Figure 5: This experiment demonstrates that our model offers a more precise motion estimation, with the warped image aligning closely to the actual physiological structure of the heart. The figure illustrates a comparison of our warped image results with those obtained using other deep-learning-based methods. The warping process involves taking a 3D image at the ED phase and transforming it with estimated displacements to create corresponding 3D images at the ES phase. Subsequently, we slice the apical four-chamber view for a detailed comparison. For clarity, the LV myocardium is segmented, and the DICE index is computed between the ground truth and warped ES phase images.

4.3.2 Results on STRAUS and 3DE Video Datasets

In Table 1, we offer quantitative comparisons of our proposed method using the STRAUS and the 3DE video datasets. The STRAUS dataset is limited exclusively to the LV, necessitating a focused comparison on this region across all methods to ensure an equitable evaluation. Regarding the 3DE video datasets, the critical role of the LV in cardiac analyses within clinical settings directs standard 3D ultrasound data collection predominantly towards the LV. Although myocardial data for the LV, left atrium (LA), and right atrium (RA) are complete, the presence of ribs often obstructs ultrasound imaging, frequently resulting in incomplete myocardial data for the right ventricle (RV) [56, 57]. This limitation impacts the effectiveness of myocardial motion tracking in the RV, prompting us to include experimental analyses for the LV, LA, and RA.

In the LV area, our method consistently outperforms competitors like VoxelMorph and Co-AttentionSTN across all evaluation metrics. It is noteworthy that the DICE coefficients for both the LA and RA regions exceed 0.7, indicating high accuracy in tracking. The results from the 3DE Video Datasets are lower than those from the STRAUS datasets. This discrepancy can be attributed to the higher signal-to-noise ratio present in the STRAUS datasets compared to the 3DE Video Datasets. Figure 3 illustrates the tracking results for both the STRAUS and 3D datasets. The images of LV myocardium at different frames are displayed from left to right. In the leftmost image, a total of 2500 points from the myocardium are selected and colored. Then, the corresponding points obtained by different motion tracking algorithms are plotted in the right 3 images. NeuralCMF demonstrates robust and consistent tracking capabilities throughout the cardiac cycle. Specifically, within the STRAUS datasets, tracking points transition from the ED to the ES phase (representing the largest movement in the cardiac cycle). There are two problems in tracking results of VoxelMorph and Co-AttentionSTN. First, the estimated motion displacement is shorter than the ground truth. The range of colored tracking points on the endocardium appears to be larger than the ground truth, as indicated by the yellow arrow in Fig. 3b1-b4. Second, points on the endocardium, which are uniformly sampled in the ED phase, become non-uniform in the ES phase as shown in Fig. 3b3, resulting in a loss tracking of starting points. In contrast, our method aligns myocardium points with the ground truth more precisely than these competing methods.

Regarding the 3DE video datasets, our approach consistently maintains tracking of the point as the myocardium transitions from ED to ES. Conversely, other methods frequently exhibit tracking inconsistencies, leading to point misplacement or loss during this significant stage. As illustrated in Fig. 3b6 and b7, both VoxelMorph and Co-AttentionSTN fail to track points on the interventricular septum (as indicated by the yellow arrow) during the ES phase. Particularly when myocardium movements are more substantial than in the ED phase, VoxelMorph and Co-AttentionSTN tend to lose track. Our method, however, reliably monitors these points across the entire cardiac cycle. Additionally, as shown in Fig. 3c5 - c7, our method outperforms VoxelMorph and Co-AttentionSTN in tracking myocardium movement, even over prolonged durations. This superiority is evident in the last frame of Fig. 3, where our method continues to track the interventricular septum accurately, whereas VoxelMorph and Co-AttentionSTN both lose their tracking points.

In addition to the LV area, we conducted comprehensive experimental analyses for the LA and RA, regions where data acquisition is typically more complete. To demonstrate the broad applicability and effectiveness of our method across different cardiac regions, we incorporated both quantitative (refer to Table 1) and qualitative assessments (see Fig. 4 and Fig. 9). For instance, as illustrated in Fig. 4, we select key points on the LA myocardium, which are displayed in the first column. The subsequent three columns are present the tracking results of various methods during different phases. Notably, VoxelMorph’s results become distorted during heart movement. In contrast, our method’s tracking outcomes remain precise and consistent. It is pivotal to highlight the tracking of the points on the mitral valve annulus. NeuralCMF stably and consistently tracks its movement, while both VoxelMorph and Co-AttentionSTN fail to track the mitral valve annulus. This further elucidates that only our method can accurately capture the longitudinal movement of LA myocardium during the cardiac cycle.

The comparison in Fig. 5 indicates that our model produces results that align more precisely with the heart’s actual structure. Figure 5 offers visual comparisons on the results of the warping process applied to 3D images. Specifically, the 3D images at the end-diastolic (ED) phase are transformed using the estimated motion vectors to create corresponding 3D images at the end-systolic (ES) phase. We then slice the apical four-chamber view from this 3D model for a detailed comparison. To offer a clearer comparison, we segment the myocardium of the LV and compute the DICE index between the ground truth and the warped images at the ES phase. Notably, the DICE score of our method surpasses that of other methods.

4.3.3 Coronary Heart Disease Classification

To explore the additional value our approach brings to identifying disease pathologies, we conduct an experiment aimed at distinguishing between healthy individuals and patients with CHD using the motion data derived from our novel method. We extract 3D motion vectors from the LV area. Following the [58], these vectors are subsequently projected into a 10-dimensional (10-D) space using Principal Component Analysis (PCA), effectively capturing the core motion information. For the task of classification, we utilize the 10 most distinctive PCA components derived from the LV area. Our findings reveal that by using the motion information from the LV, we attain a 90% accuracy rate in distinguishing healthy individuals from CHD patients, utilizing a 10-fold cross-validation method alongside Support Vector Machines (SVM). It’s important to note that the methods we compared against, namely VoxelMorph [31] and Co-AttentionSTN [30], represent the state of the art in registration/motion tracking algorithms, with classification accuracy rates of 75% and 72% respectively. The results are presented in Table 2. The increased accuracy in our approach can be attributed to the more precise tracking of LV motion, which, as a result, enables more accurate classification of healthy individuals versus CHD patients.Additionally, we apply PCA to reduce the motion information of LV to a 2D vector for each subject, facilitating a visual representation of the separation between the two groups. This visualization is presented in Fig. 6. This visualization elucidates the enhanced classification clarity provided by our method.This experiment highlights the capability of our methodology as an effective instrument for improving disease diagnostics, especially for CHD that may elude detection by existing approaches, e.g. chamber size and ejection fraction (EF).

Refer to caption
Figure 6: Our method significantly enhances the clarity of CHD classification by projecting the motion vectors of test images onto a 2D PCA space. Our method achieves a 90% classification accuracy using SVM with 10-fold cross-validation, while the accuracy of VoxelMorph is 75% and Co-AttentionSTN is 72%.
Table 2: Comparison of CHD classification accuracy among different methods.
Methods Classification Accuracy
VoxelMorph [31] 75%
Co-AttentionSTN [30] 72%
Ours 90%

4.3.4 Results on 2DE Video Datasets

Our model transcends mere motion estimation by concurrently reconstructing a dynamic heart from 2DE videos, showcasing its capacity for handling multiple tasks simultaneously. The performance of our approach is validated through a comprehensive assessment that includes both quantitative and qualitative evaluations. The motion estimation results of the 2DE video datasets are thoroughly detailed in Table 1, offering a statistical insight into the model’s precision and consistency in tracking motion. Figure 3 visually represents the point tracking results, enabling a more intuitive understanding of the method’s effectiveness across cardiac cycles. Additionally, our model’s ability to reconstruct the 3D dynamic heart is evaluated by comparing its reconstruction results with the 3D hearts captured using a 3DE probe, as illustrated in Fig. 7. Compared to the 3DE probe, the 3D heart structure represented by our approach closely aligns with that of the 3DE probe. Moreover, our results exhibit a superior signal-to-noise ratio than the 3DE probe. This experiment validates the multitasking ability of our method, demonstrating its robustness in handling complex cardiac imaging tasks.

Refer to caption
Figure 7: We show snapshots of a reconstructed 3D dynamic heart within one cardiac cycle using our proposed method, and compare these with the outputs from images acquired via the 3DE probe.

4.3.5 Ablation Study

In this section, we explore the architecture of our network through a comprehensive ablation study, the findings of which are shown in Table 3. This analysis reveals that models incorporating static features outperform those relying solely on dynamic features. This may be attributed to the static feature captured by static SIREN, which can represent the cardiac structure that does not change over time, thereby providing a consistent reference frame. Such a reference is crucial for the dynamic SIREN, as it facilitates the differentiation of stationary areas from the areas exhibiting motion, enhancing the capability of the model to capture motion dynamics effectively. Furthermore, we conduct an experiment by substituting SIREN with a combination of positional encoding (PE) and MLP. The results presented in Table 3 reveal a notable decrease in performance upon substituting SIREN with PE + MLP. This finding highlights the superior efficacy of the SIREN network compared to the PE + MLP in our architecture. Our decision to model forward and backward motions separately is not solely based on empirical evidence; it also holds significant physical relevance. This differentiation allows us to ensure that the color of any given point in the current frame remains consistent with both its preceding and succeeding frames, thereby preserving continuity across frames. To substantiate this approach, we conduct additional experiments comparing the performance of a model that only includes forward motion against our method, which treats forward and backward motions distinctly. The findings, presented in Table 3, demonstrate a significant degradation in performance when backward motion is merely the inverse of forward motion. Such a decline is attributable to the model’s oversight of backward motion, wherein a point’s color in the current frame is aligned only with the succeeding frame, neglecting the color continuity with the preceding frame.

Table 3: Ablation Study on Network Structure.
Network Structure MTE Cosine Similarity
Only Dynamic Siren 3.07±0.26plus-or-minus3.070.263.07\pm 0.263.07 ± 0.26 0.76±0.03plus-or-minus0.760.030.76\pm 0.030.76 ± 0.03
PE + MLP 31.08±4.19plus-or-minus31.084.1931.08\pm 4.1931.08 ± 4.19 0.33±0.09plus-or-minus0.330.090.33\pm 0.090.33 ± 0.09
Only forward motion 4.93±0.95plus-or-minus4.930.954.93\pm 0.954.93 ± 0.95 0.54±0.06plus-or-minus0.540.060.54\pm 0.060.54 ± 0.06
Ours 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04

To deeply understand each component within our loss function, we conduct a series of ablation studies. These studies analyze the influence of different components, with the performance metrics present in Table 4. Our analysis reveals that the motion cannot be tracked without physical motion consistency loss motionsubscript𝑚𝑜𝑡𝑖𝑜𝑛\mathcal{L}_{motion}caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT and physical cardiac cycle consistency loss cyclesubscript𝑐𝑦𝑐𝑙𝑒\mathcal{L}_{cycle}caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT. This emphasizes the crucial importance of physical loss we proposed in stabilizing the learning process and ensuring convergence to a solution. Additionally, the results from Table 4 provide compelling evidence for the significance of both coordinate and intensity constraints within our loss function. When the coordinate constraint is removed (indicated as wo/coordinate), we observe a drastic increase in MTE and a decrease in cosine similarity, underscoring the importance of spatial constraints in accurate motion tracking. Similarly, the removal of the intensity constraint (indicated as wo/intensity) results in significant degradation of performance, highlighting the necessity of intensity information for precise motion estimation.

To substantiate the effectiveness of regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT term in enhancing motion smoothness, we conduct an experiment visualizing displacement changes between consecutive time points. We calculate displacement vectors between successive time steps using our model, both with and without the implementation of regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT. The magnitudes of these vectors are then plotted over time, as shown in Fig. 8. The graph illustrates the displacement magnitudes, where each point on the blue line (model with regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT) and the red line (model without regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT) represents the displacement from one time step to the next. A smoother curve, as exhibited by the blue line, indicates fewer abrupt changes, suggesting enhanced motion smoothness. In contrast, sharper peaks or significant fluctuations on the red line denote more abrupt changes, indicative of less smooth motion. The results confirm that the model incorporating regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT achieves a smoother trajectory, as evidenced by the more gradual fluctuations and reduced slope steepness in the blue line compared to the red line. Furthermore, the ablation study present in Table 4 corroborates that the regularization term not only enhances the visual smoothness but also improves the accuracy of motion tracking.

Refer to caption
Figure 8: The magnitudes of displacement between consecutive time steps for two scenarios: the blue line represents the model with regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT, and the red line represents the model without it. Each point reflects the displacement from one time step to the next. Lines with larger slopes indicate rapid changes in displacement, suggesting abrupt or significant motion changes and reduced smoothness. The gradual slopes and reduced fluctuations in the blue line demonstrate the smoothness introduced by regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT, compared to the sharper peaks and abrupt transitions in the red line. This comparison highlights the regularization strategy’s effectiveness in enhancing cardiac motion smoothness.
Table 4: Systematic Ablation Study on Loss Function Components.
Loss Function MTE Cosine Similarity
imagesubscript𝑖𝑚𝑎𝑔𝑒\mathcal{L}_{image}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT 106.44±42.87plus-or-minus106.4442.87106.44\pm 42.87106.44 ± 42.87 0.30±0.01plus-or-minus0.300.010.30\pm 0.010.30 ± 0.01
image+motionsubscript𝑖𝑚𝑎𝑔𝑒subscript𝑚𝑜𝑡𝑖𝑜𝑛\mathcal{L}_{image}+\mathcal{L}_{motion}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT 123.18±23.46plus-or-minus123.1823.46123.18\pm 23.46123.18 ± 23.46 0.31±0.20plus-or-minus0.310.200.31\pm 0.200.31 ± 0.20
image+motion+cyclesubscript𝑖𝑚𝑎𝑔𝑒subscript𝑚𝑜𝑡𝑖𝑜𝑛subscript𝑐𝑦𝑐𝑙𝑒\mathcal{L}_{image}+\mathcal{L}_{motion}+\mathcal{L}_{cycle}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT 2.93±0.14plus-or-minus2.930.142.93\pm 0.142.93 ± 0.14 0.75±0.05plus-or-minus0.750.050.75\pm 0.050.75 ± 0.05
image+motion+cycle+regsubscript𝑖𝑚𝑎𝑔𝑒subscript𝑚𝑜𝑡𝑖𝑜𝑛subscript𝑐𝑦𝑐𝑙𝑒subscript𝑟𝑒𝑔\mathcal{L}_{image}+\mathcal{L}_{motion}+\mathcal{L}_{cycle}+\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04
image+motion(w/ocoordinate)+cycle(w/ocoordinate)+regsubscript𝑖𝑚𝑎𝑔𝑒subscript𝑚𝑜𝑡𝑖𝑜𝑛𝑤𝑜𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒subscript𝑐𝑦𝑐𝑙𝑒𝑤𝑜𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒subscript𝑟𝑒𝑔\mathcal{L}_{image}+\mathcal{L}_{motion}(w/o\ coordinate)+\mathcal{L}_{cycle}(% w/o\ coordinate)+\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT ( italic_w / italic_o italic_c italic_o italic_o italic_r italic_d italic_i italic_n italic_a italic_t italic_e ) + caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT ( italic_w / italic_o italic_c italic_o italic_o italic_r italic_d italic_i italic_n italic_a italic_t italic_e ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT 128.61±87.62plus-or-minus128.6187.62128.61\pm 87.62128.61 ± 87.62 0.05±0.06plus-or-minus0.050.060.05\pm 0.060.05 ± 0.06
image+motion(w/ointensity)+cycle(w/ointensity)+regsubscript𝑖𝑚𝑎𝑔𝑒subscript𝑚𝑜𝑡𝑖𝑜𝑛𝑤𝑜𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦subscript𝑐𝑦𝑐𝑙𝑒𝑤𝑜𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦subscript𝑟𝑒𝑔\mathcal{L}_{image}+\mathcal{L}_{motion}(w/o\ intensity)+\mathcal{L}_{cycle}(w% /o\ intensity)+\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT ( italic_w / italic_o italic_i italic_n italic_t italic_e italic_n italic_s italic_i italic_t italic_y ) + caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT ( italic_w / italic_o italic_i italic_n italic_t italic_e italic_n italic_s italic_i italic_t italic_y ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT 5.75±0.87plus-or-minus5.750.875.75\pm 0.875.75 ± 0.87 0.15±0.10plus-or-minus0.150.100.15\pm 0.100.15 ± 0.10

In tuning the hyperparameters for each component of our loss function, we adopt a systematic approach. Our loss function is composed of three parts: intensity loss with hyperparameter α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, coordinate loss with hyperparameter α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and regularization loss with hyperparameter α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Initially, we set all hyperparameters to 1 as a baseline. To understand the influence of each hyperparameter on motion estimation, we vary them from 0.01 to 2. The results of this hyperparameter tuning are presented in Table 5, 6 and 7. After careful evaluation of the hyperparameters’ impact on our model’s performance, we observe the following trends from three tables. For the intensity loss, the hyperparameter (α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) shows the optimal results at 1. The coordinate loss hyperparameter (α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) shows optimal results at 0.1. For the regularization loss hyperparameter (α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), a value of 0.01 yields the best performance. Consequently, we adopt the hyperparameter settings of 1 for intensity, 0.1 for coordinate, and 0.01 for regularization as our final model configuration, denoted as ’Ours’ in the table. This combination results in the most effective performance, achieving an optimal balance across all metrics.

Table 5: Hyperparameter Sensitivity Analysis of α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.
Hyper Parameters Intensity (α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)
MTE Cosine Similarity
0.01 5.74±0.89plus-or-minus5.740.895.74\pm 0.895.74 ± 0.89 0.13±0.01plus-or-minus0.130.010.13\pm 0.010.13 ± 0.01
0.1 5.75±0.85plus-or-minus5.750.855.75\pm 0.855.75 ± 0.85 0.16±0.12plus-or-minus0.160.120.16\pm 0.120.16 ± 0.12
1 5.64±0.89plus-or-minus5.640.89\textbf{5.64}\pm\textbf{0.89}5.64 ± 0.89 0.32±0.03plus-or-minus0.320.03\textbf{0.32}\pm\textbf{0.03}0.32 ± 0.03
2 5.67±0.80plus-or-minus5.670.805.67\pm 0.805.67 ± 0.80 0.29±0.07plus-or-minus0.290.070.29\pm 0.070.29 ± 0.07
Ours 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04
Table 6: Hyperparameter Sensitivity Analysis of α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.
Hyper Parameters Coordinate (α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
MTE Cosine Similarity
0.01 5.64±0.81plus-or-minus5.640.815.64\pm 0.815.64 ± 0.81 0.46±0.13plus-or-minus0.460.130.46\pm 0.130.46 ± 0.13
0.1 5.60±0.88plus-or-minus5.600.88\textbf{5.60}\pm\textbf{0.88}5.60 ± 0.88 0.47±0.09plus-or-minus0.470.09\textbf{0.47}\pm\textbf{0.09}0.47 ± 0.09
1 5.64±0.89plus-or-minus5.640.895.64\pm 0.895.64 ± 0.89 0.32±0.03plus-or-minus0.320.030.32\pm 0.030.32 ± 0.03
2 5.75±0.85plus-or-minus5.750.855.75\pm 0.855.75 ± 0.85 0.30±0.22plus-or-minus0.300.220.30\pm 0.220.30 ± 0.22
Ours 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04
Table 7: Hyperparameter Sensitivity Analysis of α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.
Hyper Parameters Regularization (α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT)
MTE Cosine Similarity
0.01 3.26±0.25plus-or-minus3.260.25\textbf{3.26}\pm\textbf{0.25}3.26 ± 0.25 0.78±0.04plus-or-minus0.780.04\textbf{0.78}\pm\textbf{0.04}0.78 ± 0.04
0.1 3.98±0.15plus-or-minus3.980.153.98\pm 0.153.98 ± 0.15 0.78±0.10plus-or-minus0.780.100.78\pm 0.100.78 ± 0.10
1 5.64±0.89plus-or-minus5.640.895.64\pm 0.895.64 ± 0.89 0.32±0.03plus-or-minus0.320.030.32\pm 0.030.32 ± 0.03
2 5.78±0.87plus-or-minus5.780.875.78\pm 0.875.78 ± 0.87 0.07±0.07plus-or-minus0.070.070.07\pm 0.070.07 ± 0.07
Ours 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04

We conduct an ablation study to examine how varying the percentage of observed ground truth (GT) pixels influences the performance of our method. We systematically vary the amount of GT pixels used to supervise the network training and evaluate the impact on performance metrics. The results are displayed in Table 8 and provide insightful trends. This table provides information on the impact of varying proportions (20%, 40%, 60%, 80%, and 100%) of GT pixels on MTE and cosine similarity. Data is presented as mean ±plus-or-minus\pm± standard deviation. Analyzing the content of the table, we can observe trends from two dimensions: MTE and cosine similarity. Overall, 100% GT supervision yields the best results, evidenced by the decrease in MTE and the increase in cosine similarity.

Table 8: Ablation Study on the Percentage of GT Pixels.
Percentage of GT Pixels MTE Cosine Similarity
20% 2.94±0.19plus-or-minus2.940.192.94\pm 0.192.94 ± 0.19 0.79±0.04plus-or-minus0.790.040.79\pm 0.040.79 ± 0.04
40% 2.83±0.03plus-or-minus2.830.032.83\pm 0.032.83 ± 0.03 0.80±0.06plus-or-minus0.800.060.80\pm 0.060.80 ± 0.06
60% 2.81±0.24plus-or-minus2.810.242.81\pm 0.242.81 ± 0.24 0.80±0.03plus-or-minus0.800.030.80\pm 0.030.80 ± 0.03
80% 2.75±0.02plus-or-minus2.750.022.75\pm 0.022.75 ± 0.02 0.82±0.05plus-or-minus0.820.050.82\pm 0.050.82 ± 0.05
100% 2.55±0.07plus-or-minus2.550.07\textbf{2.55}\pm\textbf{0.07}2.55 ± 0.07 0.85±0.04plus-or-minus0.850.04\textbf{0.85}\pm\textbf{0.04}0.85 ± 0.04
Refer to caption
Figure 9: Additional visual results of myocardial motion tracking for the LV, LA, and right atrium (RA) (as indicated by the yellow arrow) across diverse individuals in 2DE/3DE video datasets: data of volunteers 5-8 from 3D datasets and 9-12 from multi-view 2DE datasets. Additional results are available on our project’s GitHub repository.
Table 9: Comparison of running times (in minutes) for motion estimation of an individual on the STRAUS dataset.
Methods Pre-Training Motion Estimation
VoxelMorph [31] 54.55 0.24
Co-AttentionSTN [30] 94.13 0.08
Ours (1 GPU) 0 8.58
Ours (4 GPUs) 0 3.83
Refer to caption
Figure 10: The steps to transfer a point on the LV endocardial surface to the 17 American Heart Association (AHA) segments. (a) The process begins with mapping the points on the endocardial surface onto the short-axis plane. (b) Next, we correct this mapping by aligning the apical point to the center and arranging the points uniformly on a disk with a radius of 1. Lastly, we partition the points on the resulting disk in accordance with the AHA segment’s specific disk definition.
Refer to caption
Figure 11: The circumferential, radial, and longitudinal strain curves across the complete cardiac cycle for three volunteers.

4.4 Results on Myocardial Strain Computation

Myocardial strain is a critical parameter in heart biomechanics and can be computed using Lagrangian strain [54]. The process of deriving Lagrangian strain begins with defining the deformation gradient tensor, denoted as 𝐅𝐅\mathbf{F}bold_F, to characterize the displacement of points within the heart during deformation. The expression for the heart deformation gradient tensor 𝐅𝐅\mathbf{F}bold_F is given by,

𝐅=𝟏+𝐦tt+1(Xt,t),𝐅1subscript𝐦𝑡𝑡1subscript𝑋𝑡𝑡\mathbf{F}=\mathbf{1}+\nabla\vec{\mathbf{m}}_{t\rightarrow t+1}(\vec{X}_{t},t),bold_F = bold_1 + ∇ over→ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (13)

where 𝟏1\mathbf{1}bold_1 represents the identity tensor, and 𝐦tt+1subscript𝐦𝑡𝑡1\nabla\vec{\mathbf{m}}_{t\rightarrow t+1}∇ over→ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT is the forward motion field gradient. Our method has significant advantages for calculating the forward motion field gradient because the network’s input is the coordinate and the output is the motion. This setup allows for the straightforward calculation of the motion field gradient using the neural network’s automatic differentiation mechanism. This gradient can be further expressed as,

𝐦tt+1(Xt,t)=[mxxmxymxzmyxmyymyzmzxmzymzz],subscript𝐦𝑡𝑡1subscript𝑋𝑡𝑡matrixsubscript𝑚𝑥𝑥subscript𝑚𝑥𝑦subscript𝑚𝑥𝑧subscript𝑚𝑦𝑥subscript𝑚𝑦𝑦subscript𝑚𝑦𝑧subscript𝑚𝑧𝑥subscript𝑚𝑧𝑦subscript𝑚𝑧𝑧\nabla\vec{\mathbf{m}}_{t\rightarrow t+1}(\vec{X}_{t},t)=\begin{bmatrix}\frac{% \partial m_{x}}{\partial x}&\frac{\partial m_{x}}{\partial y}&\frac{\partial m% _{x}}{\partial z}\\ \frac{\partial m_{y}}{\partial x}&\frac{\partial m_{y}}{\partial y}&\frac{% \partial m_{y}}{\partial z}\\ \frac{\partial m_{z}}{\partial x}&\frac{\partial m_{z}}{\partial y}&\frac{% \partial m_{z}}{\partial z}\end{bmatrix},∇ over→ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = [ start_ARG start_ROW start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x end_ARG end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_y end_ARG end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x end_ARG end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_y end_ARG end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x end_ARG end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_y end_ARG end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z end_ARG end_CELL end_ROW end_ARG ] , (14)

where the components mxsubscript𝑚𝑥m_{x}italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, mysubscript𝑚𝑦m_{y}italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, and mzsubscript𝑚𝑧m_{z}italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represent the components of the forward motion mtt+1(Xt,t)subscript𝑚𝑡𝑡1subscript𝑋𝑡𝑡\vec{{m}}_{t\rightarrow t+1}(\vec{X}_{t},t)over→ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t → italic_t + 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) in the x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z directions, respectively. Next, we arrive at the Lagrangian strain tensor 𝐄𝐄\mathbf{E}bold_E, computed at point Xtsubscript𝑋𝑡\vec{X}_{t}over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t,

𝐄(Xt,t)=12(𝐅𝐅𝐈).𝐄subscript𝑋𝑡𝑡12superscript𝐅top𝐅𝐈\mathbf{E}(\vec{X}_{t},t)=\frac{1}{2}(\mathbf{F^{\top}}\mathbf{F}-\mathbf{I}).bold_E ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_F - bold_I ) . (15)

To calculate the strain in a specific direction d𝑑\vec{d}over→ start_ARG italic_d end_ARG of the LV, we project the strain in that direction using,

𝐄d(Xt,t)=d𝐄(Xt,t)d,subscript𝐄𝑑subscript𝑋𝑡𝑡superscript𝑑top𝐄subscript𝑋𝑡𝑡𝑑\mathbf{E}_{\vec{d}}(\vec{X}_{t},t)=\vec{d}^{\top}\mathbf{E}(\vec{X}_{t},t)% \vec{d},bold_E start_POSTSUBSCRIPT over→ start_ARG italic_d end_ARG end_POSTSUBSCRIPT ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = over→ start_ARG italic_d end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_E ( over→ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) over→ start_ARG italic_d end_ARG , (16)

where d𝑑\vec{d}over→ start_ARG italic_d end_ARG is a 3×1313\times 13 × 1 directional vector.

In our implementation, the longitudinal direction is dl=[0,0,1]Tsubscript𝑑𝑙superscript001𝑇\vec{d}_{l}=\left[0,0,1\right]^{T}over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ 0 , 0 , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. To compute the radial direction drsubscript𝑑𝑟\vec{d}_{r}over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we first determine the normalized direction desubscript𝑑𝑒\vec{d}_{e}over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT of each point in the LV myocardium to its center, then deduce the radial direction as dr=de(dedl)dlsubscript𝑑𝑟subscript𝑑𝑒subscript𝑑𝑒subscript𝑑𝑙subscript𝑑𝑙\vec{d}_{r}=\vec{d}_{e}-(\vec{d}_{e}\cdot\vec{d}_{l})\vec{d}_{l}over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - ( over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Lastly, the circumferential direction dcsubscript𝑑𝑐\vec{d}_{c}over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is computed as the cross product of the previous two vectors, i.e., dc=dl×drsubscript𝑑𝑐subscript𝑑𝑙subscript𝑑𝑟\vec{d}_{c}=\vec{d}_{l}\times\vec{d}_{r}over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × over→ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

In the study of myocardial mechanics, a precise visualization of strain across different regions of the heart is essential. To this end, we present a method to visualize the strain across 17 American Heart Association (AHA) segments. This approach comprises the following steps:

  1. 1.

    Sampling Points: First, we sample points on the LV endocardial surface and extract their 3D coordinates. To achieve this, we uniformly slice the 3D heart into 30 2D images at the end-diastolic (ED) phase. These slices are performed along the long axis of the 3D heart.

  2. 2.

    Segmentation of Endocardial Area: Subsequently, we segment the endocardial area of the LV in each slice and determine the 3D position of that region.

  3. 3.

    Classification into AHA Segments: Following the approach proposed by [54], we classify the points on the endocardial surface into 17 AHA segments. The specific steps shown in Fig. 10 include: a). Mapping the points on the endocardial surface onto the short-axis plane. b). Correcting the mapping by aligning the apical point to the center. c). Arranging the points uniformly on a disk with a radius of 1. d). Partitioning the points on the resulting disk according to the disk definition of the AHA segment. e). Obtaining the 3D coordinates of the points in each AHA region on the disk.

  4. 4.

    Visualizing Strain Curves: We visualize strain curves for each anatomical direction. We accomplish this by averaging the strain values of all points in each AHA segment, thereby obtaining the global strain in that segment. The results are then plotted as a function of time.

This analysis is performed on a subset of three patients. The results of the global strain curve in three directions are shown in Fig. 11.

4.5 Complexity

Table 9 presents the running time of our method in comparison to VoxelMorph and Co-AttentionSTN using a single Nvidia A100 GPU with 40GB memory. Despite the longer training periods often associated with supervised learning techniques, they tend to offer shorter inference times per subject. In contrast, our approach may result in longer per-subject testing times.

Our model training involves 10,000 iterations, with 8,192 points sampled in each iteration. To optimize processing efficiency, we employ distributed data parallel (DDP) to distribute these 8,192 points across multiple GPUs. This strategy allows each GPU to process a smaller portion of the total points simultaneously, significantly reducing the processing time required for each training step. We conduct an experiment to train our model using four NVIDIA 3090 GPUs. As shown in Table 9, this approach reduced the motion estimation time to 3.83 minutes.

5 Discussion and Conclusion

The diagnostic, prognostic, and preventive measures associated with CVDs remain essential clinical concerns. Advances in cardiac motion tracking technologies, such as CMR feature tracking and speckle-tracking echocardiography have made deformation imaging a viable, reliable, and valuable clinical tool. Global or segmented strain measurement can objectively quantify myocardial deformation and provide a novel method for characterizing myocardial function, which is replacing the conventional heart function evaluation parameters. Given the ubiquity of echocardiography as a bedside imaging modality, its precision in monitoring myocardial movement is vital for identifying abnormalities in the heart. In this context, we introduce the Neural Cardiac Motion Field (NeuralCMF), which is a self-supervised algorithm for modelling cardiac structures and delineating myocardial dynamics.

Traditional approaches, whether speckle-based or surface-based, face intrinsic limitations. The former faces challenges in tracking extended temporal frames, while the latter must contend with interpolative inaccuracies, especially evident in non-uniform myocardium movement cases. NeuralCMF overcomes these challenges, providing a holistic view of cardiac motion rather than a frame-by-frame analysis. The burgeoning field of deep learning, while transformative, presents challenges, especially in data dependency. However, our NeuralCMF does not need extensive paired datasets. This is a crucial distinction from supervised approaches, as it allows our model to learn directly from the data, adhering to the inherent physical laws governing cardiac motion. The self-supervised nature of our approach depends on these universal physical principles, which all hearts follow, thus ensuring its generalizability. Every human heart exhibits similar fundamental physiological patterns and motions. By aligning our model with these principles, we enable it to predict and understand cardiac dynamics across different individuals. We have rigorously tested our method on 127 individuals, including both healthy subjects and patients with CHD. The fact that our method performs consistently across such a heterogeneous group demonstrates its potential for wide clinical application and its reliability across a broad spectrum of cardiac conditions. We provide additional results of myocardial motion tracking for LV, LA, and RA across diverse individuals in Fig. 9.

One of the remarkable features of NeuralCMF is its adaptability. The model accurately estimates myocardial motion when interpreting multi-view 2DE or 3DE videos or dealing with varied resolutions. Such adaptability, particularly its capacity to integrate with different imaging modalities, makes it a valuable tool for researchers and clinicians. The process of reconstructing 3D motion patterns from multiple 2D images is not only challenging but also immensely practical, especially important in settings like emergency departments, intensive care units, outpatient and preoperative clinics, as well as in medically underserved regions ranging from rural areas to low- and middle-income countries, extending to manned space missions. This methodology allows for the utilization of cost-effective handheld 2DE probes for capturing images from diverse angles. Subsequently, these images can be synthesized to reconstruct the 3D motion of the heart, thereby revealing its dynamic functions with high precision. We provide additional results of multi-view 2DE dataset in Fig. 9 that underscore the efficacy of our motion estimation approach. These results illustrate that our methodology is capable of precisely reconstructing the 3D structure of the heart, alongside capturing the motion patterns.

In this study, we demonstrate the capability to classify CHD by employing 3D motion data extracted from 3D datasets. Looking ahead, we plan to build a model that utilize the 3D motion data extracted from multi-view 2DE to detect CHD. This model aims to be perfectly compatible with handheld 2D ultrasound devices, facilitating rapid and reliable screening for coronary heart disease. Such advancements hold the promise of revolutionizing the approach to cardiac care, especially in underserved regions, by making high-quality diagnostic procedures more accessible and efficient. Although our research primarily concentrates on cardiac imaging, the NeuralCMF framework—holds significant potential for broader applications. Specifically, this technique could be extended to the study of other organs and different imaging modalities, suggesting a wide range of future research directions.

The current computational time per subject presents a limitation for our proposed method, particularly as the study population increases. However, this issue can be effectively addressed through several strategies, categorized into hardware acceleration and algorithmic improvements. For hardware acceleration, we plan to adopt NVIDIA’s tiny-cuda-nn framework instead of PyTorch can drastically reduce processing time, as demonstrated by the Instant-NGP [52]. Additionally, we intend to implement a distributed strategy across multiple GPUs, allowing each GPU to process a smaller portion of the data simultaneously, thereby decreasing the training time per iteration. On the algorithmic front, we aim to leverage meta-learning [59] to learn an optimal initialization of model parameters from a large heart dataset. This approach can enhance both optimization speed and accuracy. Together, these proposed adjustments are expected to effectively reduce overall computational demands, thereby enhancing the feasibility of our method for practical applications in clinical and real-world settings.

References

  • [1] A. N. Nowbar, M. Gitto, J. P. Howard, D. P. Francis, and R. Al-Lamee, “Mortality from ischemic heart disease: Analysis of data from the world health organization and coronary artery disease risk factors from ncd risk factor collaboration,” Circulation: cardiovascular quality and outcomes, vol. 12, no. 6, p. e005375, 2019.
  • [2] C. W. Tsao, A. W. Aday, Z. I. Almarzooq, A. Alonso, A. Z. Beaton, M. S. Bittencourt, A. K. Boehme, A. E. Buxton, A. P. Carson, Y. Commodore-Mensah, et al., “Heart disease and stroke statistics—2022 update: a report from the american heart association,” Circulation, vol. 145, no. 8, pp. e153–e639, 2022.
  • [3] K. W. Zhang, B. S. Finkelman, G. Gulati, H. K. Narayan, J. Upshaw, V. Narayan, T. Plappert, V. Englefield, A. M. Smith, C. Zhang, et al., “Abnormalities in 3-dimensional left ventricular mechanics with anthracycline chemotherapy are associated with systolic and diastolic dysfunction,” JACC: Cardiovascular Imaging, vol. 11, no. 8, pp. 1059–1068, 2018.
  • [4] M. J. van Mourik, D. V. Zaar, M. W. Smulders, J. Heijman, J. Lumens, J. E. Dokter, V. L. Passos, S. Schalla, C. Knackstedt, G. Schummers, et al., “Adding speckle-tracking echocardiography to visual assessment of systolic wall motion abnormalities improves the detection of myocardial infarction,” Journal of the American Society of Echocardiography, vol. 32, no. 1, pp. 65–73, 2019.
  • [5] A. Alashi, A. Mentias, A. Abdallah, K. Feng, A. M. Gillinov, L. L. Rodriguez, D. R. Johnston, L. G. Svensson, Z. B. Popovic, B. P. Griffin, et al., “Incremental prognostic utility of left ventricular global longitudinal strain in asymptomatic patients with significant chronic aortic regurgitation and preserved left ventricular ejection fraction,” JACC: Cardiovascular Imaging, vol. 11, no. 5, pp. 673–682, 2018.
  • [6] H. M. Kim, G.-Y. Cho, I.-C. Hwang, H.-M. Choi, J.-B. Park, Y. E. Yoon, and H.-K. Kim, “Myocardial strain in prediction of outcomes after surgery for severe mitral regurgitation,” JACC: Cardiovascular Imaging, vol. 11, no. 9, pp. 1235–1244, 2018.
  • [7] D. Morris, E. Belyavskiy, R. Aravind-Kumar, M. Kropf, A. Frydas, K. Braunauer, E. Marquez, M. Krisper, R. Lindhorst, E. Osmanoglou, et al., “Potential usefulness and clinical relevance of adding left atrial strain to left atrial volume index in the detection of left ventricular diastolic dysfunction. jacc cardiovasc imaging. 2018; 11 (10): 1405–15,” J. JCMG, vol. 29, 2017.
  • [8] J. J. Park, J.-B. Park, J.-H. Park, and G.-Y. Cho, “Global longitudinal strain to predict mortality in patients with acute heart failure,” Journal of the American College of Cardiology, vol. 71, no. 18, pp. 1947–1957, 2018.
  • [9] N. Risum, C. Jons, N. T. Olsen, T. Fritz-Hansen, N. E. Bruun, M. V. Hojgaard, N. Valeur, M. B. Kronborg, J. Kisslo, and P. Sogaard, “Simple regional strain pattern analysis to predict response to cardiac resynchronization therapy: rationale, initial results, and advantages,” American Heart Journal, vol. 163, no. 4, pp. 697–704, 2012.
  • [10] N. M. Fine, L. Chen, P. M. Bastiansen, R. P. Frantz, P. A. Pellikka, J. K. Oh, and G. C. Kane, “Outcome prediction by quantitative right ventricular function assessment in 575 subjects evaluated for pulmonary hypertension,” Circulation: Cardiovascular Imaging, vol. 6, no. 5, pp. 711–721, 2013.
  • [11] S. M. Chowdhury, Z. M. Hijazi, J. T. Fahey, J. F. Rhodes, S. Kar, R. Makkar, M. Mullen, Q.-L. Cao, and G. S. Shirali, “Speckle-tracking echocardiographic measures of right ventricular function correlate with improvement in exercise function after percutaneous pulmonary valve implantation,” Journal of the American Society of Echocardiography, vol. 28, no. 9, pp. 1036–1044, 2015.
  • [12] M. Leung, P. J. van Rosendael, R. Abou, N. Ajmone Marsan, D. Y. Leung, V. Delgado, and J. J. Bax, “Left atrial function to identify patients with atrial fibrillation at high risk of stroke: new insights from a large registry,” European heart journal, vol. 39, no. 16, pp. 1416–1425, 2018.
  • [13] H. Motoki, K. Negishi, K. Kusunose, Z. B. Popović, M. Bhargava, O. M. Wazni, W. I. Saliba, M. K. Chung, T. H. Marwick, and A. L. Klein, “Global left atrial strain in the prediction of sinus rhythm maintenance after catheter ablation for atrial fibrillation,” Journal of the American Society of Echocardiography, vol. 27, no. 11, pp. 1184–1192, 2014.
  • [14] J. R. Braga, H. Leong-Poi, V. E. Rac, P. C. Austin, H. J. Ross, and D. S. Lee, “Trends in the use of cardiac imaging for patients with heart failure in canada,” JAMA network open, vol. 2, no. 8, pp. e198766–e198766, 2019.
  • [15] S. Liu, Y. Wang, X. Yang, B. Lei, L. Liu, S. X. Li, D. Ni, and T. Wang, “Deep learning in medical ultrasound analysis: a review,” Engineering, vol. 5, no. 2, pp. 261–275, 2019.
  • [16] J.-U. Voigt and M. Cvijic, “2-and 3-dimensional myocardial strain in cardiac health and disease,” JACC: Cardiovascular Imaging, vol. 12, no. 9, pp. 1849–1863, 2019.
  • [17] S. Mondillo, M. Galderisi, D. Mele, M. Cameli, V. S. Lomoriello, V. Zacà, P. Ballo, A. D’Andrea, D. Muraru, M. Losi, et al., “Speckle-tracking echocardiography: a new technique for assessing myocardial function,” Journal of Ultrasound in Medicine, vol. 30, no. 1, pp. 71–83, 2011.
  • [18] M. A. Lubinski, S. Y. Emelianov, and M. O’Donnell, “Speckle tracking methods for ultrasonic elasticity imaging using short-time correlation,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 46, no. 1, pp. 82–96, 1999.
  • [19] X. Chen, H. Xie, R. Erkamp, K. Kim, C. Jia, J. Rubin, and M. O’Donnell, “3-d correlation-based speckle tracking,” Ultrasonic Imaging, vol. 27, no. 1, pp. 21–36, 2005.
  • [20] X. Papademetris, A. J. Sinusas, D. P. Dione, R. T. Constable, and J. S. Duncan, “Estimation of 3-d left ventricular deformation from medical images using biomechanical models,” IEEE transactions on medical imaging, vol. 21, no. 7, pp. 786–800, 2002.
  • [21] N. Parajuli, A. Lu, K. Ta, J. Stendahl, N. Boutagy, I. Alkhalil, M. Eberle, G.-S. Jeng, M. Zontak, M. O’Donnell, et al., “Flow network tracking for spatiotemporal and periodic point matching: Applied to cardiac motion analysis,” Medical image analysis, vol. 55, pp. 116–135, 2019.
  • [22] X. Huang, D. P. Dione, C. B. Compas, X. Papademetris, B. A. Lin, A. Bregasi, A. J. Sinusas, L. H. Staib, and J. S. Duncan, “Contour tracking in echocardiographic sequences via sparse representation and dictionary learning,” Medical image analysis, vol. 18, no. 2, pp. 253–271, 2014.
  • [23] N. Parajuli, A. Lu, J. C. Stendahl, M. Zontak, N. Boutagy, M. Eberle, I. Alkhalil, M. O’Donnell, A. J. Sinusas, and J. S. Duncan, “Integrated dynamic shape tracking and rf speckle tracking for cardiac motion analysis,” in MICCAI, pp. 431–438, Springer, 2016.
  • [24] P. Shi, A. J. Sinusas, R. T. Constable, E. Ritman, and J. S. Duncan, “Point-tracked quantitative analysis of left ventricular surface motion from 3-d image sequences,” IEEE transactions on medical imaging, vol. 19, no. 1, pp. 36–50, 2000.
  • [25] N. Lin and J. S. Duncan, “Generalized robust point matching using an extended free-form deformation model: application to cardiac images,” in 2004 2nd IEEE International Symposium on Biomedical Imaging: Nano to Macro (IEEE Cat No. 04EX821), pp. 320–323, IEEE, 2004.
  • [26] R. Blankstein and A. H. Waller, “Evaluation of known or suspected cardiac sarcoidosis,” Circulation: Cardiovascular Imaging, vol. 9, no. 3, p. e000867, 2016.
  • [27] A. Ichinose, H. Otani, M. Oikawa, K. Takase, H. Saito, H. Shimokawa, and S. Takahashi, “Mri of cardiac sarcoidosis: basal and subepicardial localization of myocardial lesions and their effect on left ventricular function,” American Journal of Roentgenology, vol. 191, no. 3, pp. 862–869, 2008.
  • [28] S. S. Ahn, K. Ta, A. Lu, J. C. Stendahl, A. J. Sinusas, and J. S. Duncan, “Unsupervised motion tracking of left ventricle in echocardiography,” in Medical imaging 2020: Ultrasonic imaging and tomography, vol. 11319, pp. 196–202, SPIE, 2020.
  • [29] K. Ta, S. S. Ahn, J. C. Stendahl, A. J. Sinusas, and J. S. Duncan, “A semi-supervised joint network for simultaneous left ventricular motion tracking and segmentation in 4d echocardiography,” in MICCAI, pp. 468–477, Springer, 2020.
  • [30] S. S. Ahn, K. Ta, S. L. Thorn, J. A. Onofrey, I. H. Melvinsdottir, S. Lee, J. Langdon, A. J. Sinusas, and J. S. Duncan, “Co-attention spatial transformer network for unsupervised motion tracking and cardiac strain analysis in 3d echocardiography,” Medical Image Analysis, vol. 84, p. 102711, 2023.
  • [31] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “Voxelmorph: a learning framework for deformable medical image registration,” IEEE transactions on medical imaging, vol. 38, no. 8, pp. 1788–1800, 2019.
  • [32] M. Alessandrini, M. De Craene, O. Bernard, S. Giffard-Roisin, P. Allain, I. Waechter-Stehle, J. Weese, E. Saloux, H. Delingette, M. Sermesant, et al., “A pipeline for the generation of realistic 3d synthetic echocardiographic sequences: Methodology and open-access database,” IEEE transactions on medical imaging, vol. 34, no. 7, pp. 1436–1451, 2015.
  • [33] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611, pp. 586–606, Spie, 1992.
  • [34] H. Chui and A. Rangarajan, “A new point matching algorithm for non-rigid registration,” Computer Vision and Image Understanding, vol. 89, no. 2-3, pp. 114–141, 2003.
  • [35] M.-M. Rohé, M. Datar, T. Heimann, M. Sermesant, and X. Pennec, “Svf-net: learning deformable image registration using shape matching,” in MICCAI, pp. 266–274, Springer, 2017.
  • [36] J. Wu, T. R. Mazur, S. Ruan, C. Lian, N. Daniel, H. Lashmett, L. Ochoa, I. Zoberi, M. A. Anastasio, H. M. Gach, et al., “A deep boltzmann machine-driven level set method for heart motion tracking using cine mri images,” Medical image analysis, vol. 47, pp. 68–80, 2018.
  • [37] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, pp. 234–241, Springer, 2015.
  • [38] H. Yu, S. Sun, H. Yu, X. Chen, H. Shi, T. S. Huang, and T. Chen, “Foal: Fast online adaptive learning for cardiac motion estimation,” in IEEE CVPR, pp. 4313–4323, 2020.
  • [39] H. Yu, X. Chen, H. Shi, T. Chen, T. S. Huang, and S. Sun, “Motion pyramid networks for accurate and efficient cardiac motion estimation,” in MICCAI, pp. 436–446, Springer, 2020.
  • [40] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” NeurIPS, vol. 33, pp. 7462–7473, 2020.
  • [41] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  • [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, pp. 770–778, 2016.
  • [43] J. Zhu, H. Zhu, Q. Zhang, F. Zhu, Z. Ma, and X. Cao, “Pyramid nerf: Frequency guided fast radiance field optimization,” International Journal of Computer Vision, pp. 1–16, 2023.
  • [44] R. Liu, Y. Sun, J. Zhu, L. Tian, and U. S. Kamilov, “Recovery of continuous 3d refractive index maps from discrete intensity-only measurements using neural fields,” Nature Machine Intelligence, vol. 4, no. 9, pp. 781–791, 2022.
  • [45] H. Zhu, Z. Liu, Y. Zhou, Z. Ma, and X. Cao, “Dnf: diffractive neural field for lensless microscopic imaging,” Optics Express, vol. 30, no. 11, pp. 18168–18178, 2022.
  • [46] Y. Chen, L. Lu, G. E. Karniadakis, and L. Dal Negro, “Physics-informed neural networks for inverse problems in nano-optics and metamaterials,” Optics express, vol. 28, no. 8, pp. 11618–11633, 2020.
  • [47] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, “Physics-informed machine learning,” Nature Reviews Physics, vol. 3, no. 6, pp. 422–440, 2021.
  • [48] L. Shen, J. Pauly, and L. Xing, “Nerp: implicit neural representation learning with prior embedding for sparsely sampled image reconstruction,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [49] R. Zha, Y. Zhang, and H. Li, “Naf: neural attenuation fields for sparse-view cbct reconstruction,” in MICCAI, pp. 442–452, Springer, 2022.
  • [50] Q. Wu, Y. Li, Y. Sun, Y. Zhou, H. Wei, J. Yu, and Y. Zhang, “An arbitrary scale super-resolution approach for 3d mr images via implicit neural representation,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 2, pp. 1004–1015, 2022.
  • [51] C. Shen, H. Zhu, Y. Zhou, Y. Liu, S. Yi, L. Dong, W. Zhao, D. Brady, X. Cao, Z. Ma, and Y. Lin, “Cardiacfield: Computational echocardiography for universal screening,” Research Square, 2023.
  • [52] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15, 2022.
  • [53] H. Zhu, S. Xie, Z. Liu, F. Liu, Q. Zhang, Y. Zhou, Y. Lin, Z. Ma, and X. Cao, “Disorder-invariant implicit neural representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [54] M. De Craene, G. Piella, O. Camara, N. Duchateau, E. Silva, A. Doltra, J. D’hooge, J. Brugada, M. Sitges, and A. F. Frangi, “Temporal diffeomorphic free-form deformation: Application to motion and strain estimation from 3d echocardiography,” Medical image analysis, vol. 16, no. 2, pp. 427–450, 2012.
  • [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [56] E. Ostenfeld, M. Carlsson, K. Shahgaldi, A. Roijer, and J. Holm, “Manual correction of semi-automatic three-dimensional echocardiography is needed for right ventricular assessment in adults; validation with cardiac magnetic resonance,” Cardiovascular Ultrasound, vol. 10, pp. 1–10, 2012.
  • [57] Y. Seo, T. Ishizu, M. Ieda, and N. Ohte, “Right ventricular three-dimensional echocardiography: the current status and future perspectives,” Journal of Echocardiography, vol. 18, pp. 149–159, 2020.
  • [58] J. Krebs, H. Delingette, B. Mailhé, N. Ayache, and T. Mansi, “Learning a probabilistic model for diffeomorphic registration,” IEEE transactions on medical imaging, vol. 38, no. 9, pp. 2165–2176, 2019.
  • [59] M. Tancik, B. Mildenhall, T. Wang, D. Schmidt, P. P. Srinivasan, J. T. Barron, and R. Ng, “Learned initializations for optimizing coordinate-based neural representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2846–2855, 2021.