Abstract
Background:
The mammalian immune system is able to generate antibodies against a huge variety
of antigens including bacteria, viruses and toxins. “Ultra-deep” DNA sequencing of
rearranged immunoglobulin genes has considerable potential in furthering our
understanding of the immune response, but is limited by the lack of high-throughput,
sequence-based method for predicting the antigen(s) a given immunoglobulin will
recognize.
Objective:
As a step towards the prediction of antibody-antigen binding from sequence data alone,
we aimed to compare the application of a range of machine learning approaches to a
collated dataset of antibody-antigen pairs in order to predict antibody-antigen binding
from sequence data.
Methods:
Data for training and testing were extracted from the PDB and Cov-AbDab databases,
and additional antibody-antigen pair data were generated using a molecular docking
protocol. Several machine learning methods including weighted nearest neighbor,
nearest neighbor with BLOSUM62 matrices and random forests were applied to the
problem.
Results:
The final dataset contained 1157 antibodies and 57 antigens combined in 5041 Ab-Ag
pairs. The best performance for prediction of interactions was obtained using nearest
neighbor with BLOSUM62 matrices which allowed around 82% accuracy on the full
dataset. These results provide a useful frame of reference as well as protocols and
considerations for machine learning and dataset creation in this area.
Conclusions:
Several machine learning approaches were compared to predict antibody- antigen
interaction from protein sequences. Both the dataset (in csv format) and the machine
learning program (coded in python) are freely available for download at
https://github.com/jessye123/ab-ag-seq-machine-learning