Improving antibody language models with native pairing

Sarah M Burbach; Bryan Briney

doi:10.1016/j.patter.2024.100967

Improving antibody language models with native pairing

Patterns (N Y). 2024 Apr 4;5(5):100967. doi: 10.1016/j.patter.2024.100967. eCollection 2024 May 10.

Authors

Sarah M Burbach^{1

2

3}, Bryan Briney^{1

2

3

4

5}

Affiliations

¹ Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA.
² Center for Viral Systems Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.
³ Multi-Omics Vaccine Evaluation Consortium, The Scripps Research Institute, La Jolla, CA 92037, USA.
⁴ Scripps Consortium for HIV/AIDS Vaccine Development, The Scripps Research Institute, La Jolla, CA 92037, USA.
⁵ San Diego Center for AIDS Research, The Scripps Research Institute, La Jolla, CA 92037, USA.

Abstract

Existing antibody language models are limited by their use of unpaired antibody sequence data. A recently published dataset of ∼1.6 × 10⁶ natively paired human antibody sequences offers a unique opportunity to evaluate how antibody language models are improved by training with native pairs. We trained three baseline antibody language models (BALM), using natively paired (BALM-paired), randomly-paired (BALM-shuffled), or unpaired (BALM-unpaired) sequences from this dataset. To address the paucity of paired sequences, we additionally fine-tuned ESM (evolutionary scale modeling)-2 with natively paired antibody sequences (ft-ESM). We provide evidence that training with native pairs allows the model to learn immunologically relevant features that span the light and heavy chains, which cannot be simulated by training with random pairs. We additionally show that training with native pairs improves model performance on a variety of metrics, including the ability of the model to classify antibodies by pathogen specificity.

Abstract

Grants and funding