Solving deep-learning density functional theory via variational autoencoders

Emanuele Costa; Giuseppe Scriva; Sebastiano Pilati

doi:10.1088/2632-2153/ad611f

1. Introduction

Since a few decades, density functional theory (DFT) has arguably been the most popular and effective simulation technique for solid-state systems and for chemical compounds [1, 2]. It allows scientists to predict the electronic properties at a feasible computational cost, in particular in its orbital-free implementation [3, 4]. However, the exact form of the universal energy-density functional is unknown, and the available approximations often fail in the presence of strong electron correlations [5].

In recent years, machine learning (ML) techniques have been introduced in the framework of DFT [6], addressing continuous-space systems [7–9], as well as lattice [10, 11] and spin models [12]. The main goal is to learn from data more reliable energy-density functionals, potentially adequate also for strongly correlated systems. The envisioned strategy consists of training ML models, e.g. deep neural networks (NNs), exploiting datasets of ground-state energies and density profiles generated via an accurate but computationally expensive method. In principle, this would then allow one to determine the ground-state properties of novel system instances by simply minimizing the deep-learning (DL) functional, leading to a substantial reduction of computational cost. Unfortunately, in actual implementations of energy minimization, severe problems have emerged, in particular when exploiting gradient-descent methods [6, 8, 13–16]. Indeed, even minuscule inaccuracies of the functional get amplified. This leads to the formation of noisy density profiles that cannot be properly processed by the DL functionals, leading to large errors and violations of the variational property. A few strategies to circumvent these instabilities have already been proposed. They focus mostly on reducing noise in the gradient via dimensionality reduction [13] or basis truncation [17], on constraining the optimization, on training functionals with derivatives data [14, 18, 19], or on implementing tailored NN architectures that lead to more stable derivatives [15]. These studies addressed mostly low-dimensional single-particle models with random potentials, since these allow one to analyze the above instabilities while nimbly creating data for training and testing due to the affordable computational cost.

This article presents an alternative, particularly versatile, strategy to solve DL-DFTs. It exploits DL also to automatically guide the energy minimization, not only to learn the energy-density functional. Specifically, this strategy involves learning a compressed encoding of virtually all realistic profiles using a variational auto-encoder (VAE). This is achieved by training the VAE to accurately reproduce a dataset of exact ground-state densities. Then, this VAE is combined with a separate NN that maps the density to the corresponding ground-state energy. Finally, automatic differentiation is exploited to perform gradient-descent minimization in the encoded space of the VAE. The overall approach is schematically represented in figure 1. As we numerically demonstrate, our strategy allows exploring realistic profiles, without introducing constraints that prevent the descent from reaching the actual ground state, and at the same time it avoids instabilities and violations of the variational property.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Schematic representation of the $\beta-$ VAE, of the DL functional, and of the combination of these architectures to perform a stable minimization of the DL functional. The optimization is performed in the latent space of the VAE such that only the manifold of actual ground state densities is explored.
Download figure:
Standard image High-resolution image

**Figure 1.** Schematic representation of the $\beta-$ VAE, of the DL functional, and of the combination of these architectures to perform a stable minimization of the DL functional. The optimization is performed in the latent space of the VAE such that only the manifold of actual ground state densities is explored.
Download figure:
Standard image High-resolution image

Our investigation focuses on three testbed models. The first two are one-dimensional (1D) single-particle models taken from the previous literature on this topic [14, 15]. Notably, the third is a challenging three-dimensional (3D) potential formed via multiple Gaussian bumps with random centers and amplitudes. In all cases, the VAE allows us to accurately solve the DL-DFT, reaching ground-state energies with errors below the chemical accuracy and with accurate density profiles, free of numerical artifacts. We emphasize that the problem addressed here is the instability encountered in the energy minimization. As in all DL-based DFTs, training data must be gathered. However, considering two 3D testbeds with modified feature w.r.t the training data, we demonstrate that transfer learning is in principle possible, meaning that novel families of potentials can be tackled without necessarily restraining the VAE. To favor future investigations on DL-DFTs for 3D systems, we provide a large dataset of ground-state energies and densities of the 3D Gaussian model at the repository [20].

The rest of the article is organized as follows: section 2 describes the DL-DFT approach, the NN used to learn the energy-density functional, and the VAE used to encode the density profiles. It also explains how to perform energy minimization in the encoded space of the VAE making use of automatic differentiation. In section 3, our three testbed models are described. The results on density encoding and, chiefly, on solving DL-DFT via energy minimization are reported in section 4. In section 5, a summary of our main findings is reported.

2. VAE for DFT

We develop and test the DL-DFT method [8, 9, 19, 21] addressing single-particle Hamiltonians, defined as

$\begin{equation} H = -\frac{\hbar^2}{2m}\nabla^2 + V\left(x\right),\end{equation} \tag{ 1 }$

where $\hbar$ is the reduced Planck constant, m is the particle mass, V(x) is the external potential, and with x we denote the particle coordinate. Our focus is on 1D and on more challenging 3D models. The chosen energy unit is $E_0 = \frac{\hbar^2}{ma_0^2}$ , where a₀ is the length unit. If m is identified with the electron mass and a₀ with the Bohr radius, the energy unit corresponds to the Hartree energy.

The aim of DFT is to map the ground-state density profile, which in the single-particle case is computed as

$\begin{equation} n_{{\mathrm{gs}}}\left(x\right) = |\psi_{{\mathrm{gs}}}\left(x\right)|^2,\end{equation} \tag{ 2 }$

where $\psi_{{\mathrm{gs}}}(x)$ is the ground-state wave function, to the ground-state energy $e_{{\mathrm{gs}}}$ . The first Hohenberg-Kohn theorem guarantees the existence of this mapping [1]. In practice, it is convenient to separate the known potential energy contribution, seeking for the universal functional $f_{{\mathrm{gs}}} = e_{{\mathrm{gs}}}-\int \textrm{d} x \, V(x) n(x)$ . In the single-particle scenario, this universal functional corresponds to the kinetic functional term ⁴ . The second Hohenberg-Kohn theorem defines the variational property for the energy functional $E[n] = F[n]+\int \textrm{d} x \, n(x) V(x)$ :

$\begin{equation} E\left[n\right] \unicode{x2A7E} e_{{\mathrm{gs}}},\end{equation} \tag{ 3 }$

where the equality holds when $n(x) = n_{{\mathrm{gs}}}(x)$ . Many DFT studies introduce the Kohn–Sham formalism [22], which provides one with a suitable approximation for the kinetic energy functional. This comes at the cost of introducing a set of orbitals, which are typically found via self-consistent iterations. Here, as in previous studies of DL-DFT approaches [6–9, 13–15, 19], we adopt the computationally cheaper orbital free approach, where the ground-state properties of the many-body system are obtained via minimization of the energy functional $E[n]$ . The long-term ambition is to obtain, via orbital-free DL approaches, comparable if not superior accuracy than in Kohn–Sham schemes at a reduced computational cost.

2.1. NNs

2.1.1. VAE

VAEs [23, 24] are specific instances of autoencoders, typically used for dimensionality reduction and image generation. They are defined by two conditional probabilities. There first is the encoder conditional probability, fixed by the density profile, and defined in the latent space

$\begin{equation} p_{\phi}\left(\mathbf{z}|\boldsymbol{n}\right) = \mathcal{N}\left(\boldsymbol{\mu}_{\phi}\left[\boldsymbol{n}\right], \boldsymbol{\sigma}_{\phi}\left[\boldsymbol{n}\right]\right),\end{equation} \tag{ 4 }$

where, in 1D, the discretized density $\boldsymbol{n} = \left(n(x_1),n(x_2),\ldots ,n(x_{N_g})\right)$ is defined over a uniformly spaced grid of N_g points (the generalization to higher dimensions is straightforward). $\boldsymbol{\mu}_{\phi}$ and $\boldsymbol{\sigma}_{\phi}$ are outputs of a NN defined in $\mathbb{R}^{l_d}$ , and $\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\sigma})$ indicates the multivariate normal distribution with mean $\boldsymbol{\mu} = \left( \mu_1,\mu_2,\ldots ,\mu_{l_d}\right)$ and standard deviation $\boldsymbol{\sigma} = \left(\sigma_1,\sigma_2,\ldots ,\sigma_{l_d}\right)$ . The second conditional probability, namely, the decoder conditional probability, is fixed by the corresponding latent variable z and it is defined in the density profile manifold as

$\begin{equation} q_{\theta}\left(\boldsymbol{n}|\mathbf{z}\right) = \mathcal{N}\left(D_{\theta}\left[\mathbf{z}\right],\boldsymbol{1}\right),\end{equation} \tag{ 5 }$

where $D_{\theta}[\mathbf{z}] \in \mathbb{R}^{N_g}$ . VAEs are designed for a two-fold goal, namely, accurately reconstructing the ground-state density profiles through the compressed latent space and generating novel realistic density profiles from points sampled in the latent space. This is achieved by appropriately regularizing the latent space. For this, loss function is defined as follows:

$\begin{equation} L\left(\theta,\phi\right) = L_{\mathrm{rec}} + \beta L_{\mathrm{reg}}.\end{equation} \tag{ 6 }$

$L_{\mathrm{rec}}$ is the reconstruction loss

$\begin{equation} L_{\mathrm{rec}} = \sum^{N_{\mathrm{train}}}_{r = 1}\mathbb{E}_{p_{\phi}\left(\mathbf{z}|\boldsymbol{n}^{\left(r\right)}\right)} \log q_{\theta}\left(\boldsymbol{n}^{\left(r\right)}|\mathbf{z}\right),\end{equation} \tag{ 7 }$

where $\mathbb{E}$ indicates the expectation value with respect to a probability distribution and $N_{\mathrm{train}}$ is the number of training instances. $L_{\mathrm{reg}}$ is the regularization loss

$\begin{equation} L_{\mathrm{reg}} = \sum^{N_{\mathrm{train}}}_{r = 1} \mathrm{KL}\left(p_{\phi}\left(\mathbf{z}|\boldsymbol{n}^{\left(r\right)}\right)\,||\,p\left(\mathbf{z}\right)\right),\end{equation} \tag{ 8 }$

where $\mathrm{KL}(p||q)$ denotes the Kullback-Leibler divergence between the probability distributions p and q, and $p(\mathbf{z})$ is the prior probability, which corresponds to the standard normal distribution $\mathcal{N}(\mathbf{0},\mathbf{1})$ .

For β = 1 the total loss corresponds to the evidence lower bound [23–25]. In general, β can be treated as a hyperparameter; it controls the interplay between the regularization of the latent space and the reconstruction accuracy [24, 26]. The regularization loss forces the conditional distributions $p_{\phi}(\mathbf{z}|\boldsymbol{n})$ to resemble the standard normal. In this way, the components of the latent variables can be made less entangled [24, 27, 28] and the overlap between distributions corresponding to different inputs can be increased. These effects contribute to a more dense and regular structure of the latent space, but this comes at the cost of lower reconstruction accuracy. As we show in section 2.2, β plays indeed an important role in the convergence and in the stability of the gradient-descent minimization of the DL-DFT. Another important hyperparameter is the latent space dimension l_d , which determines whether all relevant information of the density profiles can be extracted. This hyperparameter must be tuned depending on the complexity of the problem, specifically, on the variability of the density profiles corresponding to different Hamiltonian instances. We analyze this effect considering different disordered potentials which lead, depending on the random parameters of the Hamiltonian, either to rather consistent or to quite variable density profiles.

The encoder network is composed of a series of convolutional blocks. Each block is made of a convolutional layer, a smooth activation function called Softplus [29], an average pooling operation, and a batch normalization [30]. At the end of the series of convolutional blocks, two dense heads made by three dense layers, with respectively $100, 50$ , and l_d hidden neurons with the Softplus activation function, process the output of the convolutional blocks and return $\boldsymbol{\sigma}_{\phi}[\boldsymbol{n}]$ and $\boldsymbol{\mu}_{\phi}[\boldsymbol{n}]$ .

The decoder network is composed symmetrically to the encoder. The latent variable is processed by a linear operation that returns the input of the forward convolutional blocks. The input shape is fixed such that the output of the series of block convolutions has the same shape as the input of the encoder. The convolutional blocks are composed of transpose convolutional layers, a Softplus activation function, and batch normalization. The last block features the identity activation function. To take into account the constraints of the density profile, namely, normalization and positivity, the output is processed by a sigmoid layer [31], followed by a normalization operation. The normalization is performed by applying the numerical integration via sum rule $\int \textrm{d} x \, f(x) \rightarrow \Delta x \sum_{i} f(x_{i})$ , where the generalization to higher dimension is straightforward. The same rule is applied to all numerical integrations, and it is verified that N_g is large enough to suppress the effect of the discretization error below the chosen target.

The training process is performed using the reparameterization trick described in [23]. The training is performed using adaptive stochastic gradient descent (ADAM) [32] for 1200 epochs, batch size 100 and a learning rate equal to 10⁻⁴. The split training/validation is $80 \% / 20 \%$ for all the datasets considered. Table 1 resumes all the hyperparameters adopted for the training.

Table 1. Hyperparameters for VAE.

	1D Gaussian	1D speckle	3D Gaussian
Epochs	1200	1200	1200
Learning rate	10⁻⁴	10⁻⁴	10⁻⁴
Batch	100	100	100
Latent space	4,8	16	32
Conv. channels	60	60	60, 120, 180
Conv. layers	5	5	3
Neurons (dense lay.)	100, 50, l_d	100, 50, l_d	100, 50, l_d
Kernel	13	13	3
Pooling	2	2	2
Optimizer	ADAM	ADAM	ADAM
Act. func.	Softplus	Softplus	Softplus

2.1.2. DL-functional

The DL-functional $\widetilde{F}_{\omega} [n_{{\mathrm{gs}},k}]$ is trained to map the density profile n(x) to the corresponding universal functional value $f_{{\mathrm{gs}}}$ . We consider a supervised learning approach by using a dataset $\{n_{{\mathrm{gs}},k}, f_{{\mathrm{gs}},k} \}$ , where the index k labels Hamiltonian instances. The network parameters ω are optimized by minimizing the mean squared error loss function

$\begin{equation} \mathcal{L}\left(\omega\right) = \frac{1}{N_{{\mathrm{train}}}}\sum_{k = 1}^{N_{{\mathrm{train}}}}| f_{{\mathrm{gs}},k}- \widetilde{F}_{\omega} \left[n_{{\mathrm{gs}},k}\right]|^2.\end{equation} \tag{ 9 }$

The NN is composed of a series of convolutional blocks, each including a convolutional layer, a Softplus activation operation [29], and an average pooling. The output of the convolutional part is processed by a linear projector that outputs the universal functional value. As for the VAE, we adopt again adaptive ADAM algorithm for 1200 epochs, batch size 100 and a learning rate equal to 10⁻⁴. The split training/validation is $80 \% / 20 \%$ for all the datasets considered. Table 2 resumes all training hyperparameters.

Table 2. Hyperparameters for DL-functional.

	1D Gaussian	1D speckle	3D Gaussian
Epochs	1200	1200	1200
Learning rate	10⁻⁴	10⁻⁴	10⁻⁴
Batch	100	100	100
Conv. channels	60	60	60
Conv. layers	5	5	4
Kernel	13	13	3
Pooling	2	2	2
Optimizer	ADAM	ADAM	ADAM
Act. func.	Softplus	Softplus	Softplus

The NNs, the trainings and the gradient descent method are implemented using the PyTorch library [33] and executed on a NVIDIA RTX A6000 GPU.

2.2. Gradient descent optimization

In the orbital free DL-DFT framework, the ground-state energy and density of novel Hamiltonian instances are determined by minimizing the DL functional using the gradient descent algorithm. The latter is defined by the following iterative step:

$\begin{equation} n_{t+1}\left(x\right) = n_{t}\left(x\right) - \eta \left( \frac{\delta \widetilde{F}_{\omega}\left[n_t\right]}{\delta n\left(x\right)} +V\left(x\right) - \mu_{t} \right),\end{equation} \tag{ 10 }$

where η > 0 determines the step size, the integer $t = 0,1,\ldots ,t_{\mathrm{max}}$ labels the steps, and the coefficient µ_t, playing the role of chemical potential, can be adapted, if needed, to ensure the normalization condition:

$\begin{equation} \int \textrm{d} x \, \, n_t\left(x\right) = 1.\end{equation} \tag{ 11 }$

Unfortunately, as discussed in several previous studies [6, 8, 13–15], this minimization often fails in the presence of even minimal inaccuracies of the DL-functional derivative $\delta \widetilde{F}_{\omega}[n]/\delta n$ . Indeed, these inaccuracies lead the descent towards nonphysical density profiles, often causing strong violation of the variational property. Some methods have already been implemented to restore stability. They are based, e.g. on the regularization of the functional derivative by linear and non-linear principal component analysis [13, 14], or on tailored NNs that reduces the noise of the functional derivative via an average operation performed on the hidden channels [15]. In this article, an alternative strategy is introduced. The instability is avoided by performing the gradient-descent minimization within the latent manifold generated by a VAE, which is trained to reproduce realistic density profiles without introducing excessively stringent constraints. The ground state energy is hence expressed as an effective functional of the latent variable z via the combination of the decoder NN that returns the corresponding density profile $\hat{n}[\mathbf{z}](x) = D_{\theta}[\mathbf{z}]$ , and the DL-DFT functional:

$\begin{equation} E\left[\mathbf{z}\right] = \widetilde{F}_{\omega}\left[\hat{n}\left[\mathbf{z}\right]\left(x\right)\right] + \int \textrm{d} x \, \, v\left(x\right) \hat{n}\left[\mathbf{z}\right]\left(x\right).\end{equation} \tag{ 12 }$

This functional has a minimum in the latent configuration corresponding to the ground-state density profile

$\begin{equation} \frac{\delta E\left[\mathbf{z}_{{\mathrm{gs}}}\right]}{\delta \mathbf{z}} = \int \textrm{d} x \, \left( \frac{\delta \widetilde{F}_{\omega}\left[n_{{\mathrm{gs}}}\right]}{\delta n_{{\mathrm{gs}}}\left(x\right)}+V\left(x\right) \right) \frac{\delta \hat{n}\left[\mathbf{z}_{{\mathrm{gs}}}\right]\left(x\right)}{\delta \mathbf{z}} = 0.\end{equation} \tag{ 13 }$

Clearly, the actual ground state is reached only if it can be generated by decoding one of the points of the latent space. We will show that flexible enough VAEs can be easily implemented, while also avoiding unphysical artifacts that cause instabilities. With the above rearrangement, the gradient-descent algorithm is written as:

$\begin{equation} \mathbf{z}_{t+1} = \mathbf{z}_{t} - \eta \frac{\delta E\left[\mathbf{z}_t\right]}{\delta \mathbf{z}_t},\end{equation} \tag{ 14 }$

where η > 0 determines the steps size in the latent space. Equation (14) does not contain the adaptive coefficient µ_t because the normalization constraint, as well as the positivity constraint, is already embodied in the decoder. The value of η is set also depending on the value of β, since the regularization loss affects the norm of the latent variables. In our study η ranges from 10⁻² to 10⁶. The number of gradient-descent steps ranges from $t_{\mathrm{max}} = 9500$ to $t_{\mathrm{max}} = 30000$ , depending on the learning rate. Suitable choices for the initial density profile $n_0(x)$ are the average density profile of the training dataset or one randomly chosen configuration of the same dataset. Once the minimization is converged to a latent space point $\mathbf{z}_{\min}$ , the ground-state energy and density profile can be estimated as $e_{\min} = E[\mathbf{z}_{\min}]$ and $n_{\min}(x) = \hat{n}[\mathbf{z}_{\min}](x)$ , respectively.

3. Testbed models and training dataset

In the following, we describe the three testbed models considered in this article.

3.1. 1D Gaussian barrier potential

The first testbed model is a 1D single particle Hamiltonian with a barrier formed by three Gaussians. It was originally introduced in [14] to study ML-DFTs. The potential is defined as:

$\begin{equation} V\left(x\right) = \sum_{i = 1}^3 a_i \exp \left( -\frac{\left(x-b_i\right)^2}{2 c^2_i}\right),\end{equation} \tag{ 15 }$

with $a_i,b_i,c_i$ randomly sampled from the uniform distribution in the intervals $[1E_0,10E_0]$ , $[0.4a_0,0.6a_0]$ , $[0.03a_0,0.1a_0]$ . For this system, we consider a box of size $L = a_0$ and a grid of $N_g = 256$ points, with hard-wall boundary conditions. For this testbed model, the density profiles corresponding to different random parameters display small variations. This is due to the chosen boundary conditions and to the selected ranges of parameters. The training and testing datasets are obtained via exact diagonalization of the Hamiltonian matrix obtained via a nine-point finite-difference discretization. It includes $N_{{\mathrm{train}}} = 15000$ instances. For the test dataset, we use parameters from [14].

3.2. 1D speckle potential

The second testbed is a 1D Hamiltonian with an external potential describing a random optical field. It was previously used in the framework of DL-DFT in [15]. This model represents the effect of disordered potentials used in cold-atom experiments. The potential can be numerically created from a Gaussian random complex field, via the Fourier filter described in [34, 35]. At any point x, the intensity of the potential V follows the probability distribution

$\begin{equation} P\left(V\right) = \exp\left(-\frac{V}{V_0}\right)\end{equation} \tag{ 16 }$

where $V_0 \unicode{x2A7E} 0$ is the average intensity and $V \unicode{x2A7E}0$ . The size of the speckle grains determines the characteristic disorder correlation length, which we set to coincide with the unit length a₀. Specifically, this length scale determines the extent of the two-point autocorrelation function, which reads:

$\begin{equation} \frac{\overline{V\left(x^{^{\prime}} +x\right)V\left(x^{^{\prime}}\right)}}{V^{2}_0} -1 = \frac{\sin\left(\pi x/ a_0\right)^2}{\left(\pi x / a_0\right)^2},\end{equation} \tag{ 17 }$

where the bar indicates the average over a large ensemble of disorder realizations or, equivalently, over space in a sufficiently large box. For the training and testing datasets, we fix the disorder strength at the intermediate value $V_0 = 0.25 E_0$ , with the system size $L = 14 a_0$ , and a grid including $N_g = 256$ points. Periodic boundary conditions are adopted. Again, the ground-state energies and density profiles are determined via an eleven-point finite-difference method [36]. The number of instances for the training dataset is about $N_{{\mathrm{train}}} = 120\,000$ and it is available at the repository in [37]. Due to the large system size compared to the disorder correlation length, combined with the choice of periodic boundary conditions, different instances of the speckle potential typically lead to quite different density profiles. Hence, this potential represents a complementary testbed compared to the 1D Gaussian model.

3.3. 3D Gaussian potential

The third model is a 3D random potential, which is introduced here as a novel challenging testbed for DL-DFT. It allows the creation of substantially different ground-state density profiles. The potential satisfies periodic boundary conditions in a 3D box of size $L = a_0$ . It is defined considering $N_{\mathrm{gauss}} = 10$ scattering centers, with random positions ${\mathbf{g}}_i = (g_{xi},g_{yi},g_{zi})$ , where $i = 1,\ldots ,N_{\mathrm{gauss}}$ , and the Cartesian coordinates $g_{\gamma i}\in[0,L]$ , with $\gamma = x,y,z$ , are sampled from a uniform random distribution in the box, namely, $g_{\gamma i} \sim \mathrm{Uniform}([0,L])$ . The effect of each scatterer is described as a Gaussian multiplied by a unique uniform random amplitude $A_i\sim \mathrm{Uniform}([0,2])$ , but featuring the same standard deviation σ. Hence, the potential at position ${\mathbf{r}} = (x,y,z)$ is computed as:

$\begin{equation} V\left( {\boldsymbol{r}} \right) = \frac{E_0}{N_{\mathrm{gauss}}}\sum_{i = 1}^{N_{\mathrm{gauss}}} \frac{A_i}{\left(2\pi\right)^{3/2}\left(\sigma/a_0\right)^3}\exp\left[ - \frac{\left\| {\boldsymbol{\Delta} r}_i \right\|^2 } {2\sigma^2} \right];\end{equation} \tag{ 18 }$

here, the distance from the ith scatterer ${\boldsymbol{\Delta}r}_i = (\Delta x_i,\Delta y_i,\Delta z_i)$ is computed adopting the minimum image convention, namely, $\Delta \gamma_i = \delta \gamma_i - L \lfloor{\delta \gamma_i / L}\rceil$ , where $\lfloor{\;}\rceil$ denotes the nearest-integer function, and $\delta \gamma_i = \gamma - g_{\gamma i}$ . This allows complying with periodic boundary conditions. We set $\sigma = a_0/6$ . With this choice, periodic images of scatterers beyond the nearest one are irrelevant. The training dataset has $N_{{\mathrm{train}}} = 36\,000$ instances. The ground-state properties are determined via an eleven-point finite difference formula for the 3D Laplacian [36] with a grid of $N_g = 18$ points per direction. In section 4.3, two modified versions of the random potentials defined by (18) are considered to analyze the feasibility of a transfer learning strategy.

4. Results

4.1. Density profile reconstruction by VAE

The first analysis we perform aims at attesting only the reconstruction accuracy of the VAE. Specifically, we check whether the trained VAE is able to accurately replicate in output the density profiles provided in input. If successful, this test would demonstrate that VAEs allow creating compressed representations of density profiles without loss of any relevant information. As illustrative examples, for this preliminary test we mostly focus on the two 1D potentials. Special attention is devoted to the role of the hyperparameter β, which tunes the relative weight attributed to the regularization loss compared to the reconstruction term. The latent space dimension is set at the intermediate values $l_d = 8$ and $l_d = 16$ for the Gaussian and the speckle potentials, respectively. The (larger) choice for the latter is due to the enhanced variability of the corresponding density profiles, compared to those corresponding to the Gaussian model. The role of l_d is further discussed in section 4.2.

The reconstruction is performed via the combined application of the encoder and the decoder on an input density profile:

$\begin{equation} \hat{\boldsymbol{n}} = D_{\theta}\left[\boldsymbol{\mu}_{\phi}\left[\boldsymbol{n}\right]\right].\end{equation} \tag{ 19 }$

To quantify the reconstruction accuracy, we determine the average of the integrated absolute density difference, which we denote with $\overline{|\hat{\boldsymbol{n}}-\boldsymbol{n}|}$ , and the average is computed over a test set of $N_{{\mathrm{ts}}} = 100$ samples. As shown in figure 2, for both testbeds the reconstruction error rapidly decreases with β, allowing reaching faithful profiles in regimes where, as fully discussed in section 4.2 in the framework of gradient-descent optimization, the latent space is still sufficiently regular. It is worth mentioning already here that, if the VAE is not able to faithfully produce all ground-state densities, it is not adequate to guide the gradient-descent minimization of the DL functional, since it might prevent reaching the actual ground state of some Hamiltonian instances.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Mean reconstruction error $\overline{|\hat{\boldsymbol{n}}-\boldsymbol{n}|}$ as a function of the (adimensional) regularization parameter β for the 1D speckle potential (blue circles) and the 1D Gaussian potential (red squares). The adimensional error measure $\overline{|\hat{\boldsymbol{n}}-\boldsymbol{n}|}$ is the integrated absolute density discrepancy, averaged over $N_{{\mathrm{ts}}} = 100$ test instances. The error bar denote the standard deviation. In both cases, the reconstruction accuracy worsens as β increases.
Download figure:
Standard image High-resolution image

**Figure 2.** Mean reconstruction error $\overline{|\hat{\boldsymbol{n}}-\boldsymbol{n}|}$ as a function of the (adimensional) regularization parameter β for the 1D speckle potential (blue circles) and the 1D Gaussian potential (red squares). The adimensional error measure $\overline{|\hat{\boldsymbol{n}}-\boldsymbol{n}|}$ is the integrated absolute density discrepancy, averaged over $N_{{\mathrm{ts}}} = 100$ test instances. The error bar denote the standard deviation. In both cases, the reconstruction accuracy worsens as β increases.
Download figure:
Standard image High-resolution image

A visual representation of the reconstruction accuracy for different values of the regularization parameter β is provided in figure 3; there, one representative Hamiltonian instance for each 1D testbed is considered. One notices that large values of β cause substantial distortions in the reconstructed profiles, which indeed do not precisely reproduce all details of the ground truth data. Analogous findings are obtained for the 3D Gaussian potential. For example, for $\beta = 10^{-6}$ and latent space dimension $l_d = 32$ , the reconstruction error is as small as $\overline{|\boldsymbol{n}-\hat{\boldsymbol{n}}|} = 0.003(1)$ .

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Illustrative examples of reconstructed density profiles compared to the ground-truth data n(x) (thick yellow curve), for the 1D Gaussian potential (a) and for the 1D speckle potential (b). Different values of the hyperparameter β are considered. The length and density units are a₀ and $1/a_0$ , respectively. The corresponding external potentials V(x) are also shown (dotted curves), referred to the right vertical axis in units of E₀.
Download figure:
Standard image High-resolution image

**Figure 3.** Illustrative examples of reconstructed density profiles compared to the ground-truth data n(x) (thick yellow curve), for the 1D Gaussian potential (a) and for the 1D speckle potential (b). Different values of the hyperparameter β are considered. The length and density units are a₀ and $1/a_0$ , respectively. The corresponding external potentials V(x) are also shown (dotted curves), referred to the right vertical axis in units of E₀.
Download figure:
Standard image High-resolution image

To further characterize the regularity of the VAE, we analyze the density profiles generated while moving along a chosen path in the latent space. Again, this is important to ascertain whether the trained VAE can help performing the gradient-based minimization of DL-based functionals in DFT. Specifically, we consider two representative instances of ground-state profiles of the 1D Gaussian potential, denoted with n _i and n _f, while the corresponding representation mean generated by the trained encoder are $\boldsymbol{\mu}_{\phi}[\boldsymbol{n}_i]$ and $\boldsymbol{\mu}_{\phi}[\boldsymbol{n}_f]$ , respectively. A path connecting the two representations is selected considering the following convex hull

$\begin{equation} \boldsymbol{\mu}_{\phi}^{\lambda}\left[\boldsymbol{n}\right] = \lambda \boldsymbol{\mu}_{\phi}\left[\boldsymbol{n}_f\right] + \left(1-\lambda\right)\boldsymbol{\mu}_{\phi}\left[\boldsymbol{n}_i\right],\end{equation} \tag{ 20 }$

where $\boldsymbol{\mu}_{\phi}^{0}[\boldsymbol{n}] = \boldsymbol{\mu}_{\phi}[\boldsymbol{n}_i]$ and $\boldsymbol{\mu}_{\phi}^{1}[\boldsymbol{n}] = \boldsymbol{\mu}_{\phi}[\boldsymbol{n}_f]$ , and $\lambda\in[0,1]$ . In figure 4 we show representative reconstructed profiles $\hat{\boldsymbol{n}}^{\lambda} = D_{\theta}[\boldsymbol{\mu}^{\lambda}_{\phi}[\boldsymbol{n}]]$ and the predicted ground-state energies as a function of λ. The key role played by the regularization parameter β is noticeable. For an excessively small value, irregular density profiles featuring nonphysical cusps are generated. For such profiles, the DL functional predicts energies approaching the absolute minimum of the whole training dataset, suggesting that the energy minimization would get stuck in a nonphysical profile. Instead, a sufficiently regularized VAE only generates realistic profiles, with predicted energy values well inside the expected range.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Illustrative examples of density profiles $n^{\lambda}(x)$ , for a 1D Gaussian potential, decoded from points along a path in the latent space parameterized by $\lambda \in [0,1]$ . The path is formed by the convex hull connecting the latent-space points corresponding to two ground-state profiles. The regularization parameters are $\beta = 10^{-5}$ (a) and $\beta = 10^{-2}$ (b). The insets show the energies predicted by the DL functional as a function of λ. The horizontal (red) dashed lines indicate the minimum energy over the whole training dataset. The length and density units are a₀ and $1/a_0$ , respectively.
Download figure:
Standard image High-resolution image

**Figure 4.** Illustrative examples of density profiles $n^{\lambda}(x)$ , for a 1D Gaussian potential, decoded from points along a path in the latent space parameterized by $\lambda \in [0,1]$ . The path is formed by the convex hull connecting the latent-space points corresponding to two ground-state profiles. The regularization parameters are $\beta = 10^{-5}$ (a) and $\beta = 10^{-2}$ (b). The insets show the energies predicted by the DL functional as a function of λ. The horizontal (red) dashed lines indicate the minimum energy over the whole training dataset. The length and density units are a₀ and $1/a_0$ , respectively.
Download figure:
Standard image High-resolution image

4.2. Gradient descent results

Hereafter, the VAE trained to encode ground-state density profiles (see section 4) is employed to guide the energy minimization of the DL functional, following the approach described in section 2. Specifically, we analyze the accuracy of the gradient descent algorithm, inspecting whether the actual ground-state is reached. The goal is to avoid both spurious constraints that would lead to a positive bias, as well as instabilities due to unphysical profiles which often lead to negative biases, i.e. to violations of the variational property. The roles of the regularization parameter β and of the latent space dimension l_d are analyzed.

The first testbeds we discuss are the 1D Gaussian and speckle potentials. Figure 5 displays histograms of energy discrepancies after gradient descent for both the 1D Gaussian case and the 1D speckle case. When the latent space dimension is overestimated ( $l_d = 8$ for the Gaussian model), we observe both positive discrepancies in the over-regularized regime (large β) as well as sizable variational violations in the opposite regime (small β). These effects are more quantitatively analyzed in figure 6, where the average absolute discrepancies are shown as a function of β. Furthermore, illustrative instances of potentials and density profiles are shown in figure 7. One notices that excessively large values of β do not allow constructing all details of the density profile, in particular for the speckle potential. This leads to positive energy discrepancies. Too small values introduce numerical artifacts, in particular for the Gaussian model, which cause the DL-functional to provide erroneous outputs and, hence, lead to negative energy discrepancies after the energy minimization. Yet, it relatively easy to tune β and l_d within a very broad range where the energy discrepancies are typically well below the threshold of chemical accuracy, generally set at 1 kcal mol⁻¹. For example, with $l_d = 4$ for the Gaussian model and $l_d = 16$ for the (more variegate) speckle potential, β can be tuned almost at will.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Histograms of the relative energy discrepancies $e_{\min}-e_{{\mathrm{gs}}}$ for a test set of 1D Gaussian potentials (a) and 1D speckle potentials (b), after $t_{\max} = 15000$ steps of the gradient descent optimization. In (a), for β = 1, the excessive regularization of the latent space generates positive deviations. On the other hand, for $\beta = 10^{-5}$ and $l_d = 8$ , substantial violations of variational property occur (blue bins with dot pattern). A more appropriate latent space dimension, $l_d = 4$ for the 1D Gaussian model solves the issue (empty black bins). For the 1D speckle case (b), reducing β and adopting the latent dimension $l_d = 16$ improves the accuracy without affecting the stability. The energy unit $E_0 = \hbar^2/ma_0^2$ can be identified with the Hartree energy. The vertical (red) lines indicate the chemical accuracy.
Download figure:
Standard image High-resolution image

**Figure 5.** Histograms of the relative energy discrepancies $e_{\min}-e_{{\mathrm{gs}}}$ for a test set of 1D Gaussian potentials (a) and 1D speckle potentials (b), after $t_{\max} = 15000$ steps of the gradient descent optimization. In (a), for β = 1, the excessive regularization of the latent space generates positive deviations. On the other hand, for $\beta = 10^{-5}$ and $l_d = 8$ , substantial violations of variational property occur (blue bins with dot pattern). A more appropriate latent space dimension, $l_d = 4$ for the 1D Gaussian model solves the issue (empty black bins). For the 1D speckle case (b), reducing β and adopting the latent dimension $l_d = 16$ improves the accuracy without affecting the stability. The energy unit $E_0 = \hbar^2/ma_0^2$ can be identified with the Hartree energy. The vertical (red) lines indicate the chemical accuracy.
Download figure:
Standard image High-resolution image

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Mean energy error $\overline{|e_{\min}-e_{{\mathrm{gs}}}|}$ [(a) and (c)] and mean reconstruction error $\overline{|\boldsymbol{n}_{\min}-\boldsymbol{n}_{{\mathrm{gs}}}|}$ [ (b) and (d)] after gradient descent optimization for 1D speckle potential [(a) and (b)] and the 1D Gaussian model [(c) and (d)].
Download figure:
Standard image High-resolution image

**Figure 6.** Mean energy error $\overline{|e_{\min}-e_{{\mathrm{gs}}}|}$ [(a) and (c)] and mean reconstruction error $\overline{|\boldsymbol{n}_{\min}-\boldsymbol{n}_{{\mathrm{gs}}}|}$ [ (b) and (d)] after gradient descent optimization for 1D speckle potential [(a) and (b)] and the 1D Gaussian model [(c) and (d)].
Download figure:
Standard image High-resolution image

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Illustrative examples of density profiles n(x) obtained by minimizing the DL functional via gradient descent, for the 1D Gaussian potential (a) and for the 1D speckle potential (b). Different values of the regularization parameter β are considered.
Download figure:
Standard image High-resolution image

While low-dimensional testbeds for DL-DFT methods have been intensively investigated also in previous literature, here we extend our analysis to a challenging 3D potential, namely, the Gaussian-scatterer potential defined in section 3.3. In figure 8, the energy discrepancies after the gradient descent are visualized, considering the hyperparameters $\beta = 10^{-6}$ with $l_d = 32$ . Remarkably, the average absolute error is as small as $\overline{|e_{\min}-e_{{\mathrm{gs}}}|} = 0.0002(2) \; E_0$ . Notice that, again, this is below the chemical accuracy. To visualize the fidelity also of the reconstructed density profile, we show in figure 9 three slices at different values of the z coordinate. The dimensionless average density error after gradient descent is $\overline{|\boldsymbol{{n}}_{\min}-\boldsymbol{n}_{{\mathrm{gs}}}|} = 0.004(1)$ , indicating that the density profiles are accurately reproduced also in this 3D testbed.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Histograms of the relative energy discrepancies $e_{\min}-e_{{\mathrm{gs}}}$ for a test set of 3D Gaussian potentials after gradient-descent optimization. The initial profiles are generated by VAE with $\beta = 10^{-6}$ . The vertical (red) lines indicate the threshold of chemical accuracy.
Download figure:
Standard image High-resolution image

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** An illustrative example of minimum density profile $n_{\min}(x,y,z)$ (red surfaces in the middle column) at different slices, compared to the corresponding ground state density profile $n_{{\mathrm{gs}}}(x,y,z)$ (blue surface, left column). The corresponding slices of the 3D Gaussian potential $V(x,y,z)$ are also shown (right column, unit of E₀).
Download figure:
Standard image High-resolution image

**Figure 9.** An illustrative example of minimum density profile $n_{\min}(x,y,z)$ (red surfaces in the middle column) at different slices, compared to the corresponding ground state density profile $n_{{\mathrm{gs}}}(x,y,z)$ (blue surface, left column). The corresponding slices of the 3D Gaussian potential $V(x,y,z)$ are also shown (right column, unit of E₀).
Download figure:
Standard image High-resolution image

4.3. Transferability of trained VAEs

Transfer learning is a popular approach in ML, whereby trained NNs are applied to (slightly) different tasks with little or no retraining. In the context of DL-DFT, it is important to assess whether VAEs and DL-functionals trained on a family of model potentials can be applied to novel testbeds with different features. To analyze this transferability, we consider two testbeds. In the first test, the 3D Gaussian potential defined in (18) is modified by doubling the number of random Gaussian scatterers, namely, setting $N_{\mathrm{Gauss}} = 20$ and $\sigma = 2/15$ . In the second test, the number of scatterers is halved: $N_{\mathrm{Gauss}} = 5$ and $\sigma = 4/19$ . This allows us to create two test datasets of 3D potentials with different features compared to the random instances in the training dataset. Specifically, they feature variations either on shorter or on longer length scales. Gradient-descent minimization is performed using the VAE and DL-functional trained on the original dataset of 3D Gaussian potentials. The results are shown in figure 10. Notably, we find that chemical accuracy is reached in both tests of transfer learning. This indicates that it is possible to train VAEs on a dataset for a certain family of potentials, for which training data might be available or could be more easily produced, and then use them to tackle novel problems for which data cannot be gathered.

Figure 10. Refer to the following caption and surrounding text. — **Figure 10.** Histograms of the relative energy discrepancies $e_{\min}-e_{{\mathrm{gs}}}$ for three sets of 3D Gaussian potentials after gradient-descent optimization. The performance on a set of random potentials sampled using the same parameters used for the training data (empty black bins) is compared with two tests of transfer learning (blue bins with dot pattern and green bins with oblique lines pattern), for which we use different parameters w.r.t the original dataset (see section 4.3). The vertical (red) lines indicate the threshold of chemical accuracy.
Download figure:
Standard image High-resolution image

5. Conclusions

In this investigation, we have demonstrated that β-VAEs are suitable to create compressed and regular representations of the density profiles of rather variegate physical systems. Chiefly, the trained VAEs allowed us to implement an efficient and stable energy minimization of DL energy functionals, thus solving a critical problem in the growing field of DL-based DFT. Our strategy is based on the combination of two deep NNs, namely, the VAE and a separate convolutional model that maps the density profile to the corresponding energy. Automatic algorithmic differentiation is adopted on the combined model, making use of this enabling feature provided by modern DL software. In turn, this also allowed us to efficiently perform the gradient-descent optimization, also exploiting the computational performance of graphic processing unit. Our numerical analysis focused on three testbed models, including both Hamiltonians that, for different random parameters, lead to rather consistent density profiles, as well as the opposite case where density profiles vary substantially. Notably, we also addressed a 3D Hamiltonian. The role of the regularization parameter β and of the latent space dimension has been analyzed, showing that suitable parameters that avoid both violations of the variational property and positive variational biases can be easily identified. Clearly, to train the VAE, as well as for training all DL-based energy functional, a sufficiently large amount of data must be available. However, we have found that transfer learning to different model potentials is possible. This indicates that it is possible to used pre-trained VAEs to address novel potentials in a different setup, for which data might not be available. To favor further studies on DL-DFT in 3D, we provide at the repository in [20] a database suitable for training and testing deep NNs for the 3D Gaussian model.

DL approaches are being increasingly adopted with different goals in the framework of DFT (see, e.g. [7, 21, 38–47]). They are mostly used to learn energy-density functionals from data, but also to accelerate the implementation and the solution of DFT with conventional approximations (see, e.g. [48, 49]). For example, in recent preprints [50] autonormalizing flows have been used to sample the density profiles, allowing the minimization of a conventional orbital-free functional via Monte Carlo sampling. In our study, we combined the VAE with a DL-functional trained from data, avoiding Monte Carlo integration by decoding density profiles from the latent space. From a general perspective, our study highlights the use of VAEs as a computational tool to extract effective variables that describe complex quantum systems in a compressed but essentially lossless manner.

Acknowledgments

We acknowledge useful discussions with R Fazio, S Cantori, and L Brodoloni. This work was supported by the PNRR MUR Project No. PE0000023-NQSTI and by the Italian Ministry of University and Research under the PRIN2022 project 'Hybrid algorithms for quantum simulators' No. 2022H77XB7. We also acknowledge partial contributions from the PRIN-PNRR 2022 MUR project 'UEFA' - P2022NMBAJ. S P acknowledges support from the CINECA awards IsCa6_NEMCAQS and IsCb2_NEMCASRA, for the availability of high-performance computing resources and support. E C acknowledges financial support from the Ministry of Economic Affairs and Digital Transformation of the Spanish Government through the QUANTUM ENIA project call—Quantum Spain project, and by the European Union through the Recovery, Transformation and Resilience Plan—NextGenerationEU within the framework of the Digital Spain 2026 Agenda. We also acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC (Finland) and the LUMI consortium through a EuroHPC Regular Access call.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.5281/zenodo.10814855.

Dates

Peer review information

2.1.1. VAE

2.1.2. DL-functional

Solving deep-learning density functional theory via variational autoencoders

Author notes

Notes

Article metrics

Submit

Share this article

Dates

Peer review information

Abstract

1. Introduction

2. VAE for DFT

2.1. NNs

2.1.1. VAE

2.1.2. DL-functional

2.2. Gradient descent optimization

3. Testbed models and training dataset

3.1. 1D Gaussian barrier potential

3.2. 1D speckle potential

3.3. 3D Gaussian potential

4. Results

4.1. Density profile reconstruction by VAE

4.2. Gradient descent results

4.3. Transferability of trained VAEs

5. Conclusions

Acknowledgments

Data availability statement

Footnotes