XGB4mcPred: Identification of DNA N4-Methylcytosine Sites in Multiple Species Based on an eXtreme Gradient Boosting Algorithm and DNA Sequence Information
Abstract
:1. Introduction
2. Materials and Methods
2.1. Datasets
2.2. Sequence Encoding
2.3. eXtreme Gradient Boosting Algorithm (XGBoost)
2.4. Model Evaluation
3. Results and Discussion
3.1. Nucleotide Composition Analysis
3.2. Determine the Optimal Features Spaces
3.3. XGBoost Comparison with Other ML Algorithms
3.4. Comparative Analysis on Two Feature Fusion Strategies
3.5. Comparison with State-of-the-Art Predictors
3.6. Validation of Various Methods on Independent Datasets
3.7. Cross-Species Testing between Six Species
3.8. Challenges and Future Work
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rathi, P.; Maurer, S.; Summerer, D. Selective recognition of N4-methylcytosine in DNA by engineered transcription-activator-like effectors. Philos. Trans. R. Soc. Lond. 2018, 373, 1748. [Google Scholar] [CrossRef] [Green Version]
- Blow, M.J.; Clark, T.; Daum, C.G.; Deutschbauer, A.M.; Fomenkov, A.; Fries, R.; Froula, J.; Kang, D.D.; Malmstrom, R.; Morgan, R.D.; et al. The Epigenomic Landscape of Prokaryotes. PLoS Genet. 2016, 12, e1005854. [Google Scholar] [CrossRef] [Green Version]
- Fu, Y.; Luo, G.-Z.; Chen, K.; Deng, X.; Yu, M.; Han, D.; Hao, Z.; Liu, J.; Lu, X.; Dore, L.; et al. N6-methyldeoxyadenosine marks active transcription start sites in chlamydomonas. Cell 2015, 161, 879–892. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Greer, E.L.; Blanco, M.A.; Gu, L.; Sendinc, E.; Liu, J.Z.; Aristizábal-Corrales, D.; Hsu, C.H.; Aravind, L.; He, C.; Shi, Y. DNA methylation on N6-adenine in C. elegans. Cell 2015, 161, 868–878. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Heyn, H.; Esteller, M. An Adenine Code for DNA: A Second Life for N6-Methyladenine. Cell 2015, 161, 710–713. [Google Scholar] [CrossRef] [Green Version]
- Jones, P.A. Functions of DNA methylation: Islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 2012, 13, 484–492. [Google Scholar] [CrossRef] [PubMed]
- Korlach, J.; Turner, S.W. Going beyond five bases in DNA sequencing. Curr. Opin. Struct. Biol. 2012, 22, 251–261. [Google Scholar] [CrossRef] [PubMed]
- Davis, B.M.; Chao, M.C.; Waldor, M.K. Entering the era of bacterial epigenomics with single molecule real time DNA sequencing. Curr. Opin. Microbiol. 2013, 16, 192–198. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ehrlich, M.; Wilson, G.G.; Kuo, K.C.; Gehrke, C.W. N4-methylcytosine as a minor base in bacterial DNA. J. Bacteriol. 1987, 169, 939–943. [Google Scholar] [CrossRef] [Green Version]
- Booth, M.; Branco, M.; Ficz, G.; Oxley, D.; Krueger, F.; Reik, W.; Balasubramanian, S. Quantitative Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine at Single-Base Resolution. Science 2012, 336, 934–937. [Google Scholar] [CrossRef] [Green Version]
- Xiao, C.-L.; Zhu, S.; He, M.; Chen, D.; Zhang, Q.; Chen, Y.; Yu, G.; Liu, J.; Xie, S.-Q.; Luo, F.; et al. N6-Methyladenine DNA Modification in the Human Genome. Mol. Cell 2018, 71, 306–318.e7. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ko, M.; Huang, Y.; Jankowska, A.M.; Pape, U.J.; Tahiliani, M.; Bandukwala, H.S.; An, J.; Lamperti, E.D.; Koh, K.P.; Ganetzky, R.; et al. Impaired hydroxylation of 5-methylcytosine in myeloid cancers with mutant TET2. Nature 2010, 468, 839–843. [Google Scholar] [CrossRef] [Green Version]
- Schweizer, H.P. Bacterial genetics: Past achievements, present state of the field, and future challenges. Biotechnology 2008, 44, 633–641. [Google Scholar] [CrossRef] [PubMed]
- Cheng, X. DNA modification by methyltransferases. Curr. Opin. Struct. Biol. 1995, 5, 4–10. [Google Scholar] [CrossRef]
- Flusberg, B.A.; Webster, D.R.; Lee, J.H.; Travers, K.J.; Olivares, E.C.; Clark, T.; Korlach, J.; Turner, S.W. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 2010, 7, 461–465. [Google Scholar] [CrossRef] [Green Version]
- Yu, M.; Ji, L.; Neumann, D.A.; Dae-Hwan, C.; Joseph, G.; Janet, W.; He, C.; Schmitz, R.J. Base-resolution detection of N4-methylcytosine in genomic DNA using 4mC-Tet-assisted-bisulfite- sequencing. Nucleic Acids Res. 2015, 21, e148. [Google Scholar]
- Chen, W.C.; Yang, H.; Feng, P.M.; Ding, H.; Lin, H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017, 33, 3518–3523. [Google Scholar] [CrossRef]
- Ye, P.; Luan, Y.; Chen, K.; Liu, Y.; Xiao, C.; Xie, Z. MethSMRT: An integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing. Nucleic Acids Res. 2017, 45, D85–D89. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, W.; Jia, C.; Zou, Q. 4mCPred: Machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 2019, 35, 593–601. [Google Scholar] [CrossRef]
- Wei, L.; Luan, S.; Nagai, L.A.E.; Su, R.; Zou, Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2018, 35, 1326–1333. [Google Scholar] [CrossRef]
- Manavalan, B.; Basith, S.; Shin, T.H.; Wei, L.; Lee, G. Meta-4mCpred: A sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation. Mol. Ther. Nucleic Acids 2019, 16, 733–744. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wei, L.; Su, R.; Luan, S.; Liao, Z.; Manavalan, B.; Zou, Q.; Shi, X. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics 2019, 35, 4930–4937. [Google Scholar] [CrossRef] [PubMed]
- Khanal, J.; Nazari, I.; Tayara, H.; Chong, K.T. 4mCCNN: Identification of N4-Methylcytosine sites in prokaryotes using convolutional neural network. IEEE Access 2019, 7, 145455–145461. [Google Scholar] [CrossRef]
- Xu, H.; Jia, P.; Zhao, Z. Deep4mC: Systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning. Brief. Bioinform. 2021, 22, 099. [Google Scholar] [CrossRef] [PubMed]
- Liu, Q.; Chen, J.; Wang, Y.; Li, S.; Jia, C.; Song, J.; Li, F. DeepTorrent: A deep learning-based approach for predicting DNA N4-methylcytosine sites. Brief. Bioinform. 2021, 22, 124. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
- Vacic, V.; Iakoucheva, L.; Radivojac, P. Two Sample Logo: A graphical representation of the differences between two sets of sequence alignments. Bioinformactics 2006, 22, 1536–1537. [Google Scholar] [CrossRef] [Green Version]
- Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Hyperparameter | Range |
---|---|
max_depth | 3–7 |
learning_rate | 0.01–0.2 |
colsample_bytree | 0.3–1 |
subsample | 0.3–1 |
gamma | 0–1.5 |
colsample_bylevel | 0.5–0.9 |
reg_alpha | 1–1.5 |
reg_lambda | 0–1.5 |
Species | Features | ACC | MCC | SN | SP | Size |
---|---|---|---|---|---|---|
C. elegans | original | 0.844 | 0.689 | 0.852 | 0.837 | 2628 |
optimal | 0.859 | 0.718 | 0.866 | 0.853 | 214 | |
D. melanogaster | original | 0.846 | 0.692 | 0.864 | 0.828 | 2628 |
optimal | 0.859 | 0.719 | 0.875 | 0.844 | 228 | |
A. thaliana | original | 0.787 | 0.574 | 0.784 | 0.790 | 2628 |
optimal | 0.804 | 0.609 | 0.799 | 0.809 | 184 | |
E. coli | original | 0.854 | 0.709 | 0.874 | 0.835 | 2628 |
optimal | 0.903 | 0.807 | 0.922 | 0.884 | 194 | |
G. subterruneus | original | 0.840 | 0.681 | 0.832 | 0.847 | 2628 |
optimal | 0.864 | 0.729 | 0.846 | 0.882 | 241 | |
G. pickeringii | original | 0.867 | 0.735 | 0.858 | 0.877 | 2628 |
optimal | 0.902 | 0.805 | 0.888 | 0.917 | 232 |
Species | Strategies | ACC | MCC | SN | SP | p-Value |
---|---|---|---|---|---|---|
C. elegans | parallel fusion | 0.851 | 0.704 | 0.865 | 0.838 | 0.0132 |
serial fusion | 0.859 | 0.718 | 0.866 | 0.853 | ||
D. melanogaster | parallel fusion | 0.852 | 0.704 | 0.864 | 0.840 | 0.0207 |
serial fusion | 0.859 | 0.719 | 0.875 | 0.844 | ||
A. thaliana | parallel fusion | 0.792 | 0.584 | 0.787 | 0.797 | 0.0091 |
serial fusion | 0.804 | 0.609 | 0.799 | 0.809 | ||
E. coli | parallel fusion | 0.870 | 0.740 | 0.884 | 0.856 | 0.0002 |
serial fusion | 0.903 | 0.807 | 0.922 | 0.884 | ||
G. subterruneus | parallel fusion | 0.857 | 0.715 | 0.849 | 0.866 | 0.0223 |
serial fusion | 0.864 | 0.729 | 0.846 | 0.882 | ||
G. pickeringii | parallel fusion | 0.883 | 0.766 | 0.873 | 0.893 | 0.0025 |
serial fusion | 0.902 | 0.805 | 0.888 | 0.917 |
Independent | |||||
---|---|---|---|---|---|
Species | Predictors | ACC | MCC | SN | SP |
C. elegans | 4mCPred | 0.865 | 0.731 | 0.883 | 0.849 |
4mcPred-SVM | 0.842 | 0.684 | 0.828 | 0.856 | |
Meta-4mCpred | 0.870 | 0.741 | 0.843 | 0.897 | |
4mcPred-IFL | 0.815 | 0.636 | 0.751 | 0.88 | |
Deep4mC | 0.915 | 0.832 | 0.907 | 0.925 | |
DeepTorrent | 0.935 | 0.871 | 0.936 | 0.935 | |
XGB4mcPred | 0.900 | 0.800 | 0.892 | 0.908 | |
D. melanogaster | 4mCPred | 0.900 | 0.803 | 0.933 | 0.868 |
4mcPred-SVM | 0.886 | 0.771 | 0.886 | 0.885 | |
Meta-4mCpred | 0.906 | 0.812 | 0.913 | 0.899 | |
4mcPred-IFL | 0.907 | 0.814 | 0.912 | 0.902 | |
Deep4mC | 0.911 | 0.822 | 0.933 | 0.890 | |
DeepTorrent | 0.911 | 0.829 | 0.972 | 0.863 | |
XGB4mcPred | 0.918 | 0.836 | 0.926 | 0.910 | |
A. thaliana | 4mCPred | 0.816 | 0.632 | 0.842 | 0.789 |
4mcPred-SVM | 0.824 | 0.649 | 0.842 | 0.806 | |
Meta-4mCpred | 0.855 | 0.711 | 0.876 | 0.834 | |
4mcPred-IFL | 0.823 | 0.650 | 0.826 | 0.824 | |
Deep4mC | 0.868 | 0.739 | 0.824 | 0.912 | |
DeepTorrent | 0.886 | 0.773 | 0.902 | 0.872 | |
XGB4mcPred | 0.858 | 0.718 | 0.870 | 0.843 | |
E. coli | 4mCPred | 0.817 | 0.634 | 0.851 | 0.784 |
4mcPred-SVM | 0.784 | 0.569 | 0.746 | 0.821 | |
Meta-4mCpred | 0.825 | 0.650 | 0.806 | 0.843 | |
4mcPred-IFL | 0.795 | 0.594 | 0.731 | 0.858 | |
Deep4mC | 0.754 | 0.517 | 0.815 | 0.713 | |
DeepTorrent | 0.818 | 0.634 | 0.810 | 0.824 | |
XGB4mcPred | 0.862 | 0.726 | 0.821 | 0.903 | |
G. subterruneus | 4mCPred | 0.789 | 0.578 | 0.757 | 0.820 |
4mcPred-SVM | 0.811 | 0.624 | 0.783 | 0.840 | |
Meta-4mCpred | 0.850 | 0.701 | 0.817 | 0.883 | |
4mcPred-IFL | 0.686 | 0.378 | 0.591 | 0.78 | |
Deep4mC | 0.830 | 0.664 | 0.797 | 0.871 | |
DeepTorrent | 0.823 | 0.646 | 0.832 | 0.814 | |
XGB4mcPred | 0.814 | 0.629 | 0.789 | 0.840 | |
G. pickeringii | 4mCPred | 0.742 | 0.503 | 0.610 | 0.875 |
4mcPred-SVM | 0.758 | 0.515 | 0.750 | 0.765 | |
Meta-4mCpred | 0.850 | 0.700 | 0.835 | 0.865 | |
4mcPred-IFL | 0.678 | 0.358 | 0.740 | 0.615 | |
Deep4mC | 0.803 | 0.611 | 0.850 | 0.767 | |
DeepTorrent | 0.860 | 0.735 | 0.950 | 0.800 | |
XGB4mcPred | 0.820 | 0.641 | 0.790 | 0.850 |
Predictors | E. col. Dataset | C. ele. Dataset | D. mel. Dataset | A. tha. Dataset | G. sub. Dataset | G. pic. Dataset |
---|---|---|---|---|---|---|
E. coli | 0.862 | 0.680 | 0.824 | 0.762 | 0.689 | 0.725 |
C. elegans | 0.710 | 0.900 | 0.727 | 0.670 | 0.630 | 0.643 |
D. melanogaster | 0.797 | 0.683 | 0.918 | 0.822 | 0.721 | 0.785 |
A. thaliana | 0.735 | 0.671 | 0.826 | 0.858 | 0.681 | 0.725 |
G. subterruneus | 0.678 | 0.635 | 0.640 | 0.679 | 0.814 | 0.720 |
G. pickeringii | 0.690 | 0.649 | 0.636 | 0.674 | 0.749 | 0.820 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Lin, X.; Wang, R.; Fan, K.-Q.; Han, L.-J.; Ding, Z.-Y. XGB4mcPred: Identification of DNA N4-Methylcytosine Sites in Multiple Species Based on an eXtreme Gradient Boosting Algorithm and DNA Sequence Information. Algorithms 2021, 14, 283. https://doi.org/10.3390/a14100283
Wang X, Lin X, Wang R, Fan K-Q, Han L-J, Ding Z-Y. XGB4mcPred: Identification of DNA N4-Methylcytosine Sites in Multiple Species Based on an eXtreme Gradient Boosting Algorithm and DNA Sequence Information. Algorithms. 2021; 14(10):283. https://doi.org/10.3390/a14100283
Chicago/Turabian StyleWang, Xiao, Xi Lin, Rong Wang, Kai-Qi Fan, Li-Jun Han, and Zhao-Yuan Ding. 2021. "XGB4mcPred: Identification of DNA N4-Methylcytosine Sites in Multiple Species Based on an eXtreme Gradient Boosting Algorithm and DNA Sequence Information" Algorithms 14, no. 10: 283. https://doi.org/10.3390/a14100283
APA StyleWang, X., Lin, X., Wang, R., Fan, K. -Q., Han, L. -J., & Ding, Z. -Y. (2021). XGB4mcPred: Identification of DNA N4-Methylcytosine Sites in Multiple Species Based on an eXtreme Gradient Boosting Algorithm and DNA Sequence Information. Algorithms, 14(10), 283. https://doi.org/10.3390/a14100283