Machine learning for predicting antibody-antigen interaction from amino acid sequences

Background: The mammalian immune system is able to generate antibodies against a huge variety of antigens including bacteria, viruses and toxins. “Ultra-deep” DNA sequencing of rearranged immunoglobulin genes has considerable potential in furthering our understanding of the immune response, but is limited by the lack of high-throughput, sequence-based method for predicting the antigen(s) a given immunoglobulin will recognize. Objective: As a step towards the prediction of antibody-antigen binding from sequence data alone, we aimed to compare the application of a range of machine learning approaches to a collated dataset of antibody-antigen pairs in order to predict antibody-antigen binding from sequence data. Methods: Data for training and testing were extracted from the PDB and Cov-AbDab databases, and additional antibody-antigen pair data were generated using a molecular docking protocol. Several machine learning methods including weighted nearest neighbor, nearest neighbor with BLOSUM62 matrices and random forests were applied to the problem. Results: The final dataset contained 1157 antibodies and 57 antigens combined in 5041 Ab-Ag pairs. The best performance for prediction of interactions was obtained using nearest neighbor with BLOSUM62 matrices which allowed around 82% accuracy on the full dataset. These results provide a useful frame of reference as well as protocols and considerations for machine learning and dataset creation in this area. Conclusions: Several machine learning approaches were compared to predict antibody- antigen interaction from protein sequences. Both the dataset (in csv format) and the machine learning program (coded in python) are freely available for download at
PhD Doctorate
