Chen 2020
Chen 2020
Chen 2020
com/npjcompumats
ARTICLE OPEN
The dielectric constant (ϵ) is a critical parameter utilized in the design of polymeric dielectrics for energy storage capacitors,
microelectronic devices, and high-voltage insulations. However, agile discovery of polymer dielectrics with desirable ϵ remains a
challenge, especially for high-energy, high-temperature applications. To aid accelerated polymer dielectrics discovery, we have
developed a machine-learning (ML)-based model to instantly and accurately predict the frequency-dependent ϵ of polymers with
the frequency range spanning 15 orders of magnitude. Our model is trained using a dataset of 1210 experimentally measured
ϵ values at different frequencies, an advanced polymer fingerprinting scheme and the Gaussian process regression algorithm. The
developed ML model is utilized to predict the ϵ of synthesizable 11,000 candidate polymers across the frequency range 60–1015 Hz,
with the correct inverse ϵ vs. frequency trend recovered throughout. Furthermore, using ϵ and another previously studied key
design property (glass transition temperature, Tg) as screening criteria, we propose five representative polymers with desired ϵ and
Tg for capacitors and microelectronic applications. This work demonstrates the use of surrogate ML models to successfully and
1234567890():,;
rapidly discover polymers satisfying single or multiple property requirements for specific applications.
npj Computational Materials (2020)6:61 ; https://doi.org/10.1038/s41524-020-0333-6
1
School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, GA 30332, USA. 2Electrical and Computer Engineering, University of
Connecticut, 371 Fairfield Way, Storrs, CT 06269, USA. 3Polymer Program, Institute of Material Science, University of Connecticut, 97 North Eagleville Road, Storrs, CT 06269, USA.
4
Collaboratory for Advanced Computing and Simulations, University of Southern California, Los Angeles, CA 90089-0242, USA. ✉email: [email protected]
Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
L. Chen et al.
2
ϵ-prediction models were developed in our previous work32, those index. Given the limitation of available experimental values, each
are limited by the accuracy of the underlying DFPT dataset, polymer in Fig. 2a has ϵ values available at 1–8 frequency values.
especially due to the assumption of crystalline polymer structures Furthermore, this 738-polymer dataset includes 11 elements, i.e.,
(as mentioned above). More importantly, those models cannot C, H, B, O, N, S, P, Si, F, Cl, and Br and various polymer classes, e.g.,
predict the complete frequency-dependent ϵ behavior. polycarbonates, polyimide, polyamide, polyolefins, polyvinyl,
In this work, we develop an ML model to predict the frequency- polyethers and polyesters. The ϵ distribution as a function of
dependent ϵ behavior of polymers, using a dataset of 1210 frequency (in Hz) is presented in Fig. 2a, along with the
experimentally measured values at various frequencies (spanning corresponding polymer count at each frequency. We note that
15 orders of magnitude). This is achieved using a 3-level the ϵ dataset ranges from 1.3 to 11 and is slightly unbalanced in
hierarchical polymer fingerprinting scheme and the Gaussian terms of data count at different frequencies. This can be attributed
process regression (GPR) algorithm to train the model, as shown in to the difficulties experienced when making empirical measure-
Fig. 1. The resulting ML model can accurately and rapidly predict ϵ ments at various frequencies, but we believe that the data
of new polymer candidates across a wide range of frequencies, as diversity is sufficient to build reliable regression models. The
validated using the performance on unseen test set. To better trends in ϵ values for 6 common and diverse polymers highlighted
understand the ML models developed and derive simple chemical in Fig. 2a signify the importance of polymer chemistry. It is worth
trends, we investigate the key chemical features that dominate noting that ϵ of polar polymers like PVDF and polyvinyl alcohol
the ϵ of polymers. Furthermore, to showcase the predictive power (PVA) significantly decreases with an increase in frequency while
and the usefulness of the developed surrogate models, we for non-polar polymers, such as polypropylene (PP) and ETFE, ϵ is
computed the frequency-dependent ϵ of a candidate set of 11,000 not sensitive to the applied frequency. Therefore, for the ML
unseen polymers manually accumulated from various available model to capture such trends accurately, it is essential that the
sources7,21,32–34. Another critical design property (glass transition dataset is representative and balanced in terms of polymer
temperature, Tg), reflective of the thermal stability of these chemistry and count, respectively. More details on the ϵ dataset
polymers, was predicted using our previously developed ML are provided in the “Methods” section.
model32. Using these two predicted properties, five representative The next important step towards building accurate and reliable
polymers satisfying specific ϵ and Tg requirements are proposed ML models is to generate relevant features that uniquely
1234567890():,;
for capacitor and microelectronic applications. represent each polymer and also capture its frequency-
dependent ϵ behavior. To capture the polymer chemistry, we
used features from three hierarchical levels, i.e., (1) atomic-level
RESULTS fragments, (2) block-level fragments, and (3) chain-level features. A
Dataset and polymer fingerprints total of 411 chemical features were used to numerically fingerprint
As illustrated in Fig. 2a, 1210 experimental ϵ values belonging to 738 polymers. Additionally, the frequency in log-scale (log F) was
738 unique polymers were collected from the literature9,19,21,33,35–42 incorporated as the key feature to capture the frequency-
to train the ML models. These measurements were made dependent behavior, overall resulting in a 412-dimensional
at 9 frequency values (i.e., 60, 102, 103, 104, 105, 106, 107, 109, feature vector. Next, the least absolute shrinkage and selection
and 1015 Hz), at room temperature and under dry conditions. Here, operator (LASSO) method was adopted for dimensionality
ϵ values at 1015 Hz represent the optical frequency region and reduction and elimination of irrelevant features. The details on
were obtained by taking the square of the experimental refractive the fingerprinting scheme and the use of the LASSO method are
Fig. 1 Machine-learning workflow. Schematic of the workflow adopted to build general data-driven models of frequency-dependent ϵ for
polymers.
Fig. 2 Experimental dielectric constant dataset and the chemical space of training and unseen datasets. a Experimental ϵ as a function of
the frequency (unit, Hz), along with the data count at each frequency. The trends in ϵ values of six representative polymers are also shown
using dashed lines. b Chemical space of the training set (738 polymers) considered this work (light blue squares), with respect to a larger
unseen dataset of 11,000 polymers (gray circles), illustrated using the first two principal components (PC1 and PC2). A few representative
polymer classes of the training dataset are highlighted with colored symbols.
npj Computational Materials (2020) 61 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
L. Chen et al.
3
dataset ranges from 1.3 to 11, this amounts to an error of ≲7%. In
Table 1. Details of ML models. NX is the number of features.
addition to the LASSO feature reduction method, the recursive
Models Train-validation-test split Feature NX ML feature elimination (RFE) using linear support vector regression
reduction algorithm was used in the data-points-split model to backward
eliminate irrelevant features. The corresponding learning curve is
a Polymer-types (738), None 412 GPR-RBF shown in Supplementary Fig. 4, revealing that the GPR-XLASSO
Group-shuffle-split, LASSO 57 model provides higher prediction accuracy.
fivefold To further validate the generality and accuracy of the two ML
b Data-points (1210), None 412 GPR-RBF models, all frequency-dependent information of five common
K-fold, fivefold LASSO 53 polymers, namely, polyethylene terephthalate (PET), polypropy-
lene (PP), polyacrylonitrile (PAN), polyvinyl chloride (PVC) and
PDTC-HK511, was intentionally included in the 10% test set
included in the “Methods” section, while the final number (completely unseen by the 90% train set). These five polymers
of features retained for model development are summarized in were selected based on their difference in polarity, wide range of
Table 1. ϵ values, and larger availability of frequency-dependent data. The
To validate the generality, reliability and usefulness of the ML resulting parity plots between ML prediction vs. experimental ϵ
models developed in this work, the frequency-dependent ϵ of an using the GPR-XLASSO models are portrayed in Fig. 3a2, b2. The
unseen dataset of 11,000 candidate polymers previously synthe- error bars in these cases represent the GPR uncertainty and the
sized elsewhere (but for which no dielectric characterization has size of markers denote the frequency applied. It can be seen that
been done)7,21,32–34, were predicted. This unseen dataset contains the R2 for the test set of polymer-types splits and data-points splits
polymers distinct from the training dataset (of 738 polymers), but models is 0.74 and 0.92, respectively. The corresponding
is made up of the same 11 elements, i.e., C, H, B, O, N, S, P, Si, F, Cl, frequency-dependent ϵ behavior for PP, PVC, and PAN polymers
and Br. Furthermore, the chemical diversity of this unseen dataset is shown in Fig. 3a3, b3. The remaining two polymers (PET and
is quite similar to that of the training dataset (of 738 polymers), as PDTC-HK511) are available in Supplementary Fig. 5. It can be
illustrated in Fig. 2b using the first two (PC1 and PC2) components observed that frequency-dependent ϵ trend for PP and PAN are
obtained from the principal component analysis (PCA) on predicted fairly well using the polymer-type-split models,
chemical features of all polymers. The similarity of two datasets although the GPR uncertainties are slightly high due to absence
is further discussed using the agglomerative hierarchical cluster- of similar polymer chemistry within the training set. This issue is,
ing analysis in Supplementary Section 1. Note that the training however, greatly improved in the data-points-split model, wherein
dataset (light blue square) spans the chemical space well, more polymer types (695) are included in the training set as
indicating that it is representative of the unseen polymer dataset compared to that in the polymer-type-split method (with 664
(gray circles). Several representative polymer classes of the polymers).
training dataset are also labeled with colored symbols in Fig. 2b. A major benefit of the presented ML models is their ability to
predict ϵ across a wide range of frequencies (60–1015 Hz). In
Frequency-dependent machine-learning models of dielectric Fig. 3a3, b3, we also show the ϵ predictions for the three unseen
constant polymers at 1012 Hz, where empirical data is unavailable. The ML
Considering that ϵ depends on both polymer-type and the applied predictions can be seen to closely follow the available frequency-
frequency, the ML models (using the GPR algorithm) were trained dependent ϵ trend. We also compare these models with our
in two different fashions with varying train-validation-test splits, previous work utilizing DFPT-based computed ϵ values at THz
referred to here as the (1) polymer-types-split (738 polymers) and frequency (denoted as ML-DFPT). As illustrated in Supplementary
(2) data-points-split (1210 points) approach. In the former split, the Fig. 6, the ML-DFPT predicted ϵ of PET, PP, and PVC are much
test set consists of completely different polymers than those in the higher than their corresponding experimental values at 109 Hz,
training set, resulting in evaluation of ML performance on unseen leading to incorrect frequency-dependent ϵ trend; ϵ value should
polymer cases. While both random and stratified sampling decrease with increase in frequency. The reason for this
methods were used in the latter to split train-validation-test sets discrepancy is the overestimation of DFPT computed ϵ values,
across all polymers and all frequencies, as discussed in Supple- which are computed using unrealistic crystalline structures of
mentary Section 2.1. The random sampling method is selected in polymers having unreasonably higher densities than realistic
the present work due to the comparable ML performance of two semi-crystalline or amorphous case. On the other hand, the
sampling methods. For all models, fivefold cross-validation (CV) present ML models utilize information available at different
was used to avoid overfitting, and two error metrics, namely, root frequencies (both the lower regime and the higher optical region)
mean square error (RMSE) and the coefficient of determination to accurately predict the ϵ values at 1012 Hz.
(R2), were used to evaluate their performance. Overall, Fig. 3 shows that the data-points-split-based ML models
Figure 3a1, b1 show the learning curves of the ML models perform better than their polymer-types-split-based counterparts
trained using polymer-types-split and data-points-split methods, in terms of test RMSE, the error trends in the learning curve, and
respectively. The average training and test RMSE of ϵ prediction as the prediction capability of five completely unseen polymers. Such
a function of training set size is plotted, with the error bars observation is expected and understandable because of inclusion
denoting 1σ standard deviation in the reported RMSE values over of fewer polymer types in the polymer-types-split training set.
50 runs. Results for both the cases, i.e., with all 412 features (GPR- Moreover, in the data-points-split approach it is possible that the
XAll) and with those retained after LASSO dimensionality reduction same polymers with different frequencies are randomly sampled
(GPR-XLASSO) are included. As expected, the test RMSE decreases in the training and the test sets, thus improving the ML
with an increase in training set size for all cases. We note that the performance. From a theoretical standpoint, these two ML models
GPR-XLASSO does a better job of improving the ML performance provide predictive capability of ϵ at two extremes: data-points-
when trained using the data-points-split approach in comparison split model is appropriate for polymer cases with some known
with the polymer-types-split approach. Further, a higher test RMSE frequency-dependent ϵ values, while polymer-types-split model is
of 0.67 resulted in polymer-types-split models using 90 % training applicable for completely new polymers with no ϵ information.
set (664 polymers), while a test RMSE of 0.35 was obtained in data- With these systematic and careful studies, we believe that the
points-split models (with 1089 training points). Considering the ϵ random data-points-split approach is reliable and appropriate to
Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2020) 61
L. Chen et al.
4
Fig. 3 Machine-learning models of dielectric constant. ML models of ϵ based on polymer-types-split a and data-points-split b. a1 and b1 are
learning curves trained using all features (GPR-Xall) and LASSO (GPR-XLASSO) reduced features, with the error bars denoting 1σ standard
deviation in the reported RMSE values over 50 runs. a2 and b2 are parity plots using GPR-XLASSO and the 90% train set, where all frequency-
dependent information of five polymers (PP, PET, PAN, PVC, and PDTC-HK511) were intentionally included in the 10% test set. Symbol sizes
represent the frequency applied. a3 and b3 show Expt. vs ML predicted ϵ of PP, PVC and PAN in a2 and b2, respectively, with frequency = 60,
102, 103, 104, 105, 106, 107, 109, and 1015 Hz. The remaining two polymers (PET and PDTC-HK511) are available in Supplementary Fig. 5.
Furthermore, the additional ML predicted ϵ values at 1012 Hz of these three polymers are shown. Error bars in a2, a3, b2, and b3 are predicted
GPR uncertainties.
be used to train the final predictive model with the entire dataset density and thus lower ϵelec. In contrast, the presence of polar
and CV. groups, such as CH2CF2CH2, C–F, C–Cl, –OH, ketone, thioketones,
NH, amide, pyridine, pyrrole, CH2CH2O, and various fragments
Factors affecting dielectric constant including NH/amide could strongly enhance the electronic
polarity (ϵelec) of polymers. Consequently, these positive (negative)
In addition to building the ML models, it is valuable to analyze the
correlated features can increase (decrease) the total ϵ across the
key features that correlate highly with the measured ϵ behavior in entire frequency regime by controlling ϵelec. Furthermore, the
polymers. In the data-points-split approach, 53 features were structural arrangement of these functional groups strongly affects
retained from the initial set of 412 after LASSO-based dimension- the polymer ϵ value, e.g., PVDF (CF2CH2CF2CH2) has an ϵ of 9.45 at
ality reduction. Figure 4 summarizes some representative features 100 Hz while ETFE (CH2CH2CF2CF2) has an ϵ of just 2.6. Thus, it was
with strong negative or positive correlation with ϵ, with the essential to cover such special sequence-controlled block-level
corresponding coefficients available in Supplementary Fig. 7. As features in our fingerprinting scheme (e.g., CH2CH2CF2 and
expected, there is a negative correlation between log F (frequency CH2CF2CH2) to distinguish polymers. Also, the chain-level features
in log-scale) and ϵ with a coefficient of –0.93. Additionally, the including the topological polar surfaces area of polar elements
presence of certain atomic- and block-level features, including (e.g., O, N, S, F, and Cl) and the number of H-bond acceptors have
CH2CH2, CF2CF2, benzene rings, CH3, CF3, (CH3)3, and CH2CH2CH, a positive relationship with ϵ. These features can increase the ionic
and chain-level features, such as the high number of 3-vertex (ϵionic) and dipolar (ϵdipolar) parts by strengthening the H-bonding
carbon atoms, number of cyclic double bonds and presence of a and dipole interactions between polymer chains, thus increasing
purely single bond, lead to lower ϵ. The main reason being that the overall ϵ at THz and lower frequency regime. All these findings
these functional groups introduce zero or negligible net dipole can be helpful guidelines for rational design of polymers with
moments but larger free volumes, resulting in small net dipole desired frequency-dependent ϵ values.
npj Computational Materials (2020) 61 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
L. Chen et al.
5
Fig. 4 Representative features affecting dielectric constant. Representative features having strong negative or positive correlations with
ϵ. R represents an arbitrary chemical group of C, O, H, N elements, and log F denotes the log-scale frequency value used as a feature in the
ML model.
Application-specific polymers design with desired dielectric polymer ϵ, as mentioned earlier and can be seen from the selected
constant list of low ϵ polymer with amides groups (ID 6–8) and OH groups
Next, we move on to apply the developed ML model to discover (ID 9) in Fig. 6. We also note that all of the selected 10 polymers
novel polymers with desired ϵ for capacitors and microelectronic contain rigid benzene rings, resulting in high Tg. Based on the
devices. As illustrated in Fig. 5a, the frequency-dependent ϵ of the prediction accuracy reached by our models on the unseen test set,
11,000 unseen candidate polymers in Fig. 2b were predicted using the ability of the model to correctly capture inverse ϵ vs. frequency
the GPR-XLASSO model trained on the full dataset (1210 points), behavior, and the chemical arguments made above, we believe
the data-points-split approach and fivefold CV. We note that that these proposed ten polymers are good candidates for further
ϵ predictions can be made across a wide range of frequencies experimental validations.
(e.g., 60, 102, 103, 104, 105, 106, 107, 108, 109, 1012, and 1015 Hz),
although no training data is available at THz frequency. The DISCUSSION
inverse relation of predicted ϵ with frequency for these new
polymers can be observed in Fig. 5a and further validations are Using an experimental ϵ dataset of 738 polymers (or 1210 data-
points) at various frequencies, unique 3-level hierarchical polymer
shown in Supplementary Fig. 8.
To optimize polymer candidates for capacitor and microelec- features and the GPR algorithm, we built a single ML model to
accurately predict the frequency-dependent ϵ behavior of
tronic applications, in addition to ϵ, another critical design
polymers. There are several advantages of the ML models
property, Tg, is considered. Polymers with high Tg are expected
presented here: first, it can predict ϵ of polymers across a wide
to be thermally stable, which is essential for these two
range of frequencies (60–1015 Hz, excluding the resonant
applications9,43,44. Thus, in Fig. 5a, we also provide ML predicted
frequency regions). The single ML model developed here more
Tg using our previously developed models32. Based on the past accurately capture the inverse relationship between ϵ and
considerations appropriate for high-temperature energy density frequency, compared with separate ML models for ϵ at different
capacitors2,3,43,44, Tg ≥ 450 K was used as the first criterion to frequency regimes, as discussed in Supplementary Section 4. As
discover polymers for high-temperature applications. As men- the frequency in log-scale was used as a feature in the single ML
tioned earlier, polymers with high ϵ are required for capacitors, model, the frequency-dependent trend was learned from the
thus, 85 polymers with ϵ ≥ 5 (at 100 Hz) were selected from Fig. 5a training data itself. Furthermore, we found the single ML model to
expected to display high-energy density. As insulating films in be more generalizable for new cases, as it was trained using a
microelectronic devices need polymers with low ϵ to decrease the larger polymer dataset. Additional advantages of having the
signal-delay time, 191 polymers with ϵ in a range of 2.0–2.5 (at frequency in log-scale as a feature is that it allows us to make ϵ
100 Hz) were identified. For each application, the frequency- predictions at any arbitrary frequency value, which is not possible
dependent ϵ of five representative polymers is shown in Fig. 5b. with separate ML models. This complete frequency-dependent
The corresponding monomer unit, and the ML-based ϵ (at 100 Hz) picture provides comprehensive information to assist rational
and Tg (in K) predictions are summarized in Fig. 6. Here, ID 1–5 design of new polymers. The present ϵ-prediction model is already
represent cases with high ϵ for capacitors and ID 6–10 are implemented in our Polymer Genome platform (http://www.
polymers with low ϵ for microelectronic devices. polymergenome.org).
As shown in Fig. 5b, the frequency-dependent ϵ trend of ten Second, the predicted GPR uncertainty acts as a useful guide to
polymers is correctly captured. Moreover, the monomer chemistry know when the ML predictions can be trusted. The present ML
for the selected 5 polymer with high ϵ (ID 1–5) includes either model is more suitable for homo-polymers containing C, H, B, O, N,
amide, OH or C–Cl groups, agreeing with the positive correlation S, P, Si, F, Cl, and Br atoms. Also, higher uncertainties can be
trend discussed above (and shown in Fig. 4). Similarly, the expected within the frequency range of 1010–1014 Hz owing to the
presence of CF3 group and benzene rings greatly decrease the unavailability of training data in this regime. These uncertainties
Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2020) 61
L. Chen et al.
6
a b 8
Frequency (Hz) 10 polymers with Tg ≥ 450 K
7
, ML prediction 6
, ML prediction
5
2
Tg ≥ 450 K
1
102 103 104 105 106 107 108 109 1012 1015
Tg (K), ML prediction Frequency (Hz)
Fig. 5 Machine-learning-predicted dielectric constant of 11,000 unseen polymers. a ML predicted ϵ at various frequencies (i.e., 60, 102, 103,
104, 105, 106, 107, 108, 109, 1012, and 1015 Hz) for 11,000 unseen polymers from Fig. 2b, along with their ML predicted Tg values. b Ten
representative polymers with high Tg (≥450 K) selected from a, such that five polymers (ID 1–5) have high ϵ (≥5), and remaining five (ID 6–10)
have low ϵ (2–2.5).
Fig. 6 Details of ten representative polymers. The monomer unit, and the ML predicted Tg and ϵ (at 100 Hz) of ten representative polymers
shown in Fig. 5b. Polymers with ID 1–5 have high ϵ (≥5), while ID 6–10 are polymers with low ϵ (2–2.5). The associated ML prediction
uncertainty is also provided.
can provide useful guidance for next experiments via active Third, key features that strongly affect the polymer ϵ behavior
learning, with the newly generated data aiding model were analyzed, forming a crude first stage criteria to find polymers
improvement45. with the desired ϵ. To attain high ϵ, common polar groups,
including C–F, –OH, C=O and amides, and rigid groups such as
npj Computational Materials (2020) 61 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
L. Chen et al.
7
pyridine and pyrrole can be introduced into polymers. On the been synthesized and reported (but for which no dielectric characteriza-
other hand, the introduction of non-polar groups (e.g., benzene tion has been done). This dataset is substantially diverse, containing
rings and CH3) or functional groups with low polarization density numerous polymers classes, e.g., polyolefins, polyimides, polycurateda-
(e.g., CF3) leads to low ϵ. However, we note that presence of some mides, polyvinyls, polyethers, polyesters, polydienes, polyoxides, and
flexible polar groups may induce an unwanted high dielectric loss, polycarbonates, but not more complex polymers such as copolymers,
which can be further eliminated by introducing additional polymer blends, as well as ladder, cross-linked, and metal-containing
polymers. Because of the evidence of past synthetic work, polymer
screening criteria on other polymer properties, e.g., low dielectric
candidates identified for specific applications from this candidate list using
loss and high breakdown strength.
our model are expected to have good potential to be synthesized (again)
Finally, ϵ and Tg of about 11,000 polymers have been predicted and tested. This large dataset, which contains polymer identities, names/
using the ML models developed in this and our previous work32, labels, and/or monomer representations, was collected from various
respectively, providing a huge pool of polymers for various available sources, including published articles, handbooks, and online
applications. Using the Tg and ϵ as the screening criteria, 5 high repositories7,21,32–34.
and 5 low ϵ polymers are proposed for capacitors and
microelectronic devices, respectively. While this work initiates a
Feature engineering
great opportunity to select polymers satisfying two properties, it
To build accurate and reliable ML models, it is important to include
can be easily extended to three or more properties.
relevant features that numerically represent materials and collectively
Although we believe that the developed ML model is fairly capture the trends in ϵ values across wide frequency range and across
accurate and universal, more efforts are envisioned in the future. varying polymer chemistry. Our polymer fingerprinting scheme is based on
First, Fig. 3 shows that a test RMSE of 0.67 and 0.35 is achieved for a pre-defined list of possible components covering various length scales,
the polymer-types-split and data-points-split-based ML models including (1) atomic-level fragments, (2) block-level fragments, and (3)
using 90% training set and 10% test set, respectively. Therefore, it chain-level, i.e., extended features that capture higher level morphological
is expected that the average RMSE of predicted values for new information in polymers. The atomic-level fragments are specified by the
cases ranges from 0.35 to 0.67. For polymers in applications generic label “AiBjCk”, representing an i-fold coordinated A atom, a j-fold
requiring a high ϵ of 5–11, even the RMSE of 0.67 leads to an coordinated B atom, and a K-fold coordinated C atom, connected in the
acceptable relative error of 6–13.4%. For applications require specified order. For example, N3-C3-C4 represents a threefold coordinated
polymers with ϵ ranging from 2 to 3.5, the RMSE of 0.35 results in a N, a threefold coordinated carbon and a fourfold coordinated carbon. The
relative error 10–17%, which is slightly high but acceptable. The block-level fingerprint components track the presence of 363 pre-defined
relative error of some completely unseen polymers may reach to building blocks that frequently occur in conventional polymers with some
representative examples being C6H6, C=O, CH2, and CF2. More importantly,
19–33% with respect to the RMSE of 0.67. However, their
a series of triplet-blocks were defined to represent the specific structural
predicted GPR uncertainties should also be high. Therefore, more
arrangements of functional groups, e.g., CH2CH2CF2 and CH2CF2CH2. The
data should be collected from literature either manually or using occurrence of each block in the polymer repeat unit (monomer)
natural language processing techniques46 to improve the model normalized by the number of atoms (of the monomer) is used as a
performance and dataset diversity. Second, almost no empirical block-level fingerprint component. The chain-level features capture
data is available in the THz region. First-principles MD simulations information at the highest length scale, including quantitative structure-
with the reactive force fields have been recently shown to property relationship (QSPR) and morphological features. The QSPR
accurately estimate ϵ values at THz frequencies using amorphous features, e.g., van der Waals surface area, topological polar surface area,
phases of polymers47. Such method can successfully overcome the and the fraction of rotatable bonds, were generated using the RDKit
problem of ϵ overestimation introduced because of the unrealis- library. The morphological features, e.g., the length of the longest/shortest
tically higher densities of crystalline polymer models used in the side chains with/without rings and the shortest topological distance
DFPT method. There is a great opportunity to incorporate between rings, were developed by us. Using this fingerprinting scheme,
theoretical data to fill the empty THz region of our dataset. Third, 155 atomic-level, 197 block-level and 59 chain-level features were
new polymer features can be included at the morphological-level, generated for each of the 738 polymers, leading to a total of 411 chemical
features for each polymer. Additionally, the frequency in log-scale (log F)
e.g., molecular weights, cross-link and torsion angles, to represent
was incorporated as a feature in the ML model development process,
more complicated polymer chemical space. Also, more advanced resulting in a total of 412 features. As per standard ML practices, all
feature reduction methods can be developed to replace the features were scaled from 0 to 1 during the model training.
present linear LASSO method. The least absolute shrinkage and selection operator (LASSO) method
was used to retain the relevant features by optimizing the regularization
term to achieve the highest R2. Subsequently, the remaining features with
METHODS non-zero coefficients were used to construct the ML models. For the
Dataset LASSO dimensionality reduction scheme, all 412-dimensional features and
The experimental ϵ of 738 polymers, measured at room temperature, the entire ϵ dataset was used. Furthermore, the group-shuffle-split and K-
under dry conditions and at 9 frequency values, i.e., 60, 102, 103, 104, 105, fold libraries implemented in sklearn python package were respectively
106, 107, 109, and 1015 Hz, were considered in this work. These values were used for the polymer-types-split and the data-points-split approach. The
taken from refs. 9,19,21,33,35–42. The ϵ measurements within the frequency resulting number of feature (NX) is summarized in Table 1, including the
range of 60–109 Hz is commonly made using the impedance analyzer, the frequency feature internally selected by the LASSO method.
precision inductance, and capacitance and resistance (LCR) meter18,42. ϵ To visualize the chemical diversity of the training (738 polymers) and the
values at 1015 Hz were obtained by taking the square of the experimental unseen (11,000 polymers) datasets adopted here, PCA was performed on
refractive index measured using refractometers. Since experimental the complete chemical features of these two datasets (706 features in
conditions significantly impact the measured ϵ, we collected the data total), excluding the frequency feature. The first two (PC1 and PC2)
only when the measurements were made at room temperature (295 ± 5 K) components are shown in Fig. 2b and used to analysis the similarity of two
and under dry conditions (with relative humidity <1%). We note that it is datasets with the agglomerative hierarchical clustering method. As
almost impossible to find consistent sample qualities across the literature, illustrated in Supplementary Fig. 1, there are 90% shared chemical space
with the common variations observed in sample thickness and different of two datasets, revealing that the training dataset fairly covers the
order of polymer crystallinity. While such uncertainties are unavoidable in chemical space of the unseen dataset.
experimental datasets, we believe they are acceptable to train reliable ML
models. For cases where multiple data-points were available we used the
average ϵ value. Gaussian process regression
Our developed ML model was used to make prediction for a completely We used the Gaussian process regression (GPR) with the radial basis
unseen dataset of roughly 11,000 homo-polymers that have previously function (RBF) kernel to train the ML models. In this case, the co-variance
Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2020) 61
L. Chen et al.
8
function between two materials with features x and x 0 is given by 18. Wu, C. et al. Dipole-relaxation dynamics in a modified polythiourea with high
dielectric constant for energy storage applications. Appl. Phys. Lett. 115, 163901
1
kðx; x 0 Þ ¼ σ f exp 2 jjx x 0 jj2 þ σ 2n : (1) (2019).
2σ l 19. Ma, R. et al. Rationally designed polyimides for high-energy density capacitor
Here, three hyperparameters σf, σl, and σn represent the variance, the length- applications. ACS Appl. Mater. Interfaces 6, 10445–10451 (2014).
scale parameter and the expected noise in the data, respectively. These were 20. Ku, C. C. & Liepins, R. Electrical Properties of Polymers (Hanser Publishers, New
determined during the model training by maximizing the log-likelihood York, 1987).
estimate. Further, as shown in Table 1, K-fold and group-shuffle-split 21. Bicerano, J. Prediction of polymer properties (CRC Press, 2002).
methods with fivefold cross-validation were adopted in the polymer-types- 22. Wang, C. et al. Computational strategies for polymer dielectrics design. Polymer
split and the data-points-split models to avoid overfitting, respectively. The 55, 979–988 (2014).
root mean square error (RMSE) and the coefficient of determination (R2) 23. Misra, M., Mannodi-Kanakkithodi, A., Chung, T., Ramprasad, R. & Kumar, S. K.
were used to evaluate the performance of the ML models. Further, learning Critical role of morphology on the dielectric constant of semicrystalline poly-
curves (Fig. 3) were generated by varying the size of the training and the test olefins. J. Chem. Phys. 144, 234905 (2016).
sets to estimate the prediction errors on unseen data. Model performance 24. Jordan, M. I. & Mitchell, T. M. Machine learning: trends, perspectives, and pro-
(RMSE) was evaluated by averaging over 50 statistical runs with random spects. Science 349, 255–260 (2015).
training and test splits. 25. Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A. & Kim, C. Machine
learning in materials informatics: recent applications and prospects. NPJ Comput.
Mater. 3, 54 (2017).
DATA AVAILABILITY 26. Gaultois, M. W. et al. Data-driven review of thermoelectric materials: Performance
and resource considerations. Chem. Mater. 25, 2911–2920 (2013).
The dielectric constant dataset will be made available upon reasonable request for
27. Chen, L., Tran, H., Batra, R., Kim, C. & Ramprasad, R. Machine learning models for
academic use.
the lattice thermal conductivity prediction of inorganic materials. Comp. Mat. Sci.
170, 109155 (2019).
28. Batra, R., Pilania, G., Uberuaga, B. P. & Ramprasad, R. Multifidelity information
CODE AVAILABILITY fusion with machine learning: a case study of dopant formation energies in
The codes that support the findings of this study are not publicly available as they are hafnia. ACS Appl. Mater. Interfaces 11, 24906–24918 (2019).
the Intellectual Property of Georgia Tech Research Corporation. However, they may 29. Chandrasekaran, A. et al. Solving the electronic structure problem with machine
be created using the descriptions provided in ref. 32 <Polymer Genome: A Data- learning. NPJ Comput. Mater. 5, 22 (2019).
Powered Polymer Informatics Platform for Property Predictions>, and the freely 30. Wu, K. et al. Prediction of polymer properties using infinite chain descriptors (icd)
available RDKit and scikit-learn python modules. and machine learning: toward optimized dielectric polymeric materials. J. Polym.
Sci. Pol. Phys. 54, 2082–2091 (2016).
Received: 28 January 2020; Accepted: 28 April 2020; 31. Mannodi-Kanakkithodi, A., Pilania, G., Huan, T. D., Lookman, T. & Ramprasad, R.
Machine learning strategy for accelerated design of polymer dielectrics. Sci. Rep.
6, 20952 (2016).
32. Kim, C., Chandrasekaran, A., Huan, T. D., Das, D. & Ramprasad, R. Polymer genome:
a data-powered polymer informatics platform for property predictions. J. Phys.
Chem. C 122, 17575–17585 (2018).
REFERENCES 33. Mark, J. Polymer Data Handbook (Oxford University Press, 1999).
1. Chu, B. et al. A dielectric polymer with high electric energy density and fast 34. Otsuka, S., Kuwajima, I., Hosoya, J., Xu, Y. & Yamazaki, M. In 2011 International
discharge speed. Science 313, 334–336 (2006). Conference on Emerging Intelligent Data and Web Technologies, 22–29 (IEEE, 2011).
2. Li, Q. et al. Flexible high-temperature dielectric materials from polymer nano- 35. Baldwin, A. F. et al. Rational design of organotin polyesters. Macromolecules 48,
composites. Nature 523, 576 (2015). 2422–2428 (2015).
3. Tan, Q., Irwin, P. & Cao, Y. Advanced dielectrics for capacitors. IEEJ Trans. FM 126, 36. Baldwin, A. F. et al. Poly (dimethyltin glutarate) as a prospective material for high
1153–1159 (2006). dielectric applications. Adv. Mater. 27, 346–351 (2015).
4. Sharma, V. et al. Rational design of all organic polymer dielectrics. Nat. Commun. 37. Ma, R. et al. Rational design and synthesis of polythioureas as capacitor dielec-
5, 4845 (2014). trics. J. Mater. Chem. A 3, 14845–14852 (2015).
5. Huan, T. D. et al. Advanced polymeric dielectrics for high energy density appli- 38. Lorenzini, R., Kline, W., Wang, C., Ramprasad, R. & Sotzing, G. The rational design
cations. Prog. Mater. Sci. 83, 236–269 (2016). of polyurea & polyurethane dielectric materials. Polymer 54, 3529–3533 (2013).
6. Mannodi-Kanakkithodi, A. et al. Rational co-design of polymer dielectrics for 39. Chisca, S., Sava, I., Musteata, V.-E. & Bruma, M. Dielectric and conduction prop-
energy storage. Adv. Mater. 28, 6277–6291 (2016). erties of polyimide films. In 2011 International Semiconductor Conference (CAS),
7. Huan, T. D. et al. A polymer dataset for accelerated property prediction and Vol. 2, 253–256 (IEEE, 2011).
design. Sci. Data 3, 160012 (2016). 40. Mandelcorn, L. & Miller, R. L. High temperature, >200 degrees C, polymer film
8. Mannodi-Kanakkithodi, A. et al. Scoping the polymer genome: a roadmap for capacitors. In IEEE 35th International Power Sources Symposium, 369–372 (IEEE,
rational polymer dielectrics design and beyond. Mater. Today 21, 785–796 1992).
(2018). 41. Pan, J., Li, K., Chuayprakong, S., Hsu, T. & Wang, Q. High-temperature poly
9. Ho, J. S. & Greenbaum, S. G. Polymer capacitor dielectrics for high temperature (phthalazinone ether ketone) thin films for dielectric energy storage. ACS Appl.
applications. ACS Appl. Mater. Interfaces 10, 29189–29218 (2018). Mater. Interfaces 2, 1286–1289 (2010).
10. Dissado, L. A. & Fothergill, J. C. Electrical Degradation and Breakdown in Polymers, 42. Li, Z. et al. High energy density and high efficiency all-organic polymers with
Vol. 9 (IET, 1992). enhanced dipolar polarization. J. Mater. Chem. A 7, 15026–15030 (2019).
11. Maier, G. Low dielectric constant polymers for microelectronics. Prog. Polym. Sci. 43. Tan, D., Zhang, L., Chen, Q. & Irwin, P. High-temperature capacitor polymer films.
26, 3–65 (2001). J. Electron. Mater. 43, 4569–4575 (2014).
12. Dang, M. T., Hirsch, L. & Wantz, G. P3ht: Pcbm, best seller in polymer photovoltaic 44. Wu, C. et al. Flexible temperature-invariant polymer dielectrics with large band-
research. Adv. Mater. 23, 3597–3602 (2011). gap. Adv. Mater. e2000499 (2020).
13. Facchetti, A. π -conjugated polymers for organic electronics and photovoltaic cell 45. Kim, C., Chandrasekaran, A., Jha, A. & Ramprasad, R. Active-learning and materials
applications. Chem. Mater. 23, 733–758 (2011). design: the example of high glass transition temperature polymers. MRS. Com-
14. Huang, X. & Jiang, P. Core-shell structured high-k polymer nanocomposites for mun. 9, 860–866 (2019).
energy storage and dielectric applications. Adv. Mater. 27, 546–554 (2015). 46. Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge
15. Wang, Y. et al. Ultrahigh energy density and greatly enhanced discharged effi- from materials science literature. Nature 571, 95–98 (2019).
ciency of sandwich-structured polymer nanocomposites with optimized spatial 47. Fukushima, S. et al. Effects of chemical defects on anisotropic dielectric response
organization. Nano Energy 44, 364–370 (2018). of polyethylene. AIP Adv. 9, 045022 (2019).
16. Smith, O. L. et al. Enhanced permittivity and energy density in neat poly(vinyli-
dene fluoride-trifluoroethylene-chlorotrifluoroethylene) terpolymer films through
control of morphology. ACS Appl. Mater. Interfaces 6, 9584–9589 (2014). ACKNOWLEDGEMENTS
17. Nasreen, S. et al. Sn-polyester/polyimide hybrid flexible free-standing film as a
This work is supported by the Office of Naval Research through N0014-17-1-2656, a
tunable dielectric material. Macromol. Rapid Commun. 40, 1800679 (2019).
Multi-University Research Initiative (MURI) grant. We thank Dr. Rui Ma, Dr. Gregory M.
npj Computational Materials (2020) 61 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
L. Chen et al.
9
Treich and Dr. Shamima Nasreen for collecting, organizing, and providing Reprints and permission information is available at http://www.nature.com/
experimental data from their papers. reprints
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
in published maps and institutional affiliations.
AUTHOR CONTRIBUTIONS
L.C. and R.R. initiated this research project; L.C. developed and analyzed the ML
models; C.K. contributed to the development of polymer fingerprinting codes; R.B.
and R. R. contributed to the model analysis and discussions; J. L., C.W, Z.L., A.D., Y.W.,
H.T. contributed to the data collection; all co-authors contributed to the development Open Access This article is licensed under a Creative Commons
of the manuscript. Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
COMPETING INTERESTS Commons license, and indicate if changes were made. The images or other third party
The authors declare no competing interests. material in this article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly
ADDITIONAL INFORMATION from the copyright holder. To view a copy of this license, visit http://creativecommons.
Supplementary information is available for this paper at https://doi.org/10.1038/ org/licenses/by/4.0/.
s41524-020-0333-6.
Correspondence and requests for materials should be addressed to R.R. © The Author(s) 2020
Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2020) 61