Researchers have achieved significant advancements in predicting a protein’s structure from its sequence by leveraging the capabilities of large language models. Notwithstanding its potential, this approach has yielded limited returns for antibody development, largely owing to the inherent hypervariability of these proteins.
Researchers at MIT have devised a novel computational approach that significantly enhances the predictive capabilities of massive language models in forecasting antibody structures with greater precision. Researchers’ efforts could enable them to screen through tens of thousands of potential antibodies, identifying those that might be harnessed to combat SARS-CoV-2 and other infectious diseases.
“By virtue of our technique, we’re able to scale in a way that others cannot, allowing us to uncover multiple needles in the haystack,” states Bonnie Berger, Simons Professor of Mathematics, head of the Computation and Biology group at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and a senior author of the study. “If we intervene to prevent pharmaceutical companies from conducting clinical trials involving inappropriate factors, it could potentially save significant sums of money.”
The approach, centred on simulating the variable regions of antibodies, also offers promise for examining entire antibody repertoires from individual subjects. Understanding the immune responses of individuals who are tremendous responders to diseases like HIV can help identify the reasons behind their antibodies’ effectiveness in combating the virus, ultimately informing therapeutic strategies for more effective treatment and potentially even prevention.
Bryan Bryson, a renowned affiliate professor of organic engineering at MIT, is also a member of the esteemed Ragon Institute, a collaborative venture between Massachusetts General Hospital (MGH), MIT, and Harvard. As a senior author of this groundbreaking study, he brings his expertise to the forefront. Formerly a researcher at CSAIL, Rohit Singh is now an assistant professor of biostatistics, bioinformatics, and cell biology at Duke University, collaborating with Chiho Im, class of 2022, as the lead authors on this publication. Researchers from Sanofi and ETH Zurich further collaborated on this study’s findings.
Proteins consist of extended sequences of amino acids that can fold into a diverse array of possible structures. Recently, predicting protein structures has become significantly easier thanks to the advancements in artificial intelligence applications like AlphaFold. While many applications akin to ESMFold and OmegaFold are grounded in large language models originally designed to analyze vast amounts of text, enabling them to forecast the next item in a sequence. This similar methodology can also apply to protein sequences, examining how various protein architectures emerge from distinct arrangements of amino acids.
Despite its effectiveness, this approach does not always succeed in dealing with antibodies, particularly when it comes to their hypervariable regions. Antibodies typically exhibit a Y-shaped structure, featuring hypervariable regions located at the tips of the Y, where they recognize and bind to foreign proteins, commonly referred to as antigens. The underside of the Y-shaped structure provides mechanical support, facilitating coordination between antibodies and immune cells.
Hypervariable regions typically comprise less than 40 amino acids in size. The human immune system is capable of generating an astonishing 1 quintillion distinct antibodies by modulating the sequence of amino acids, thereby enabling it to respond to a vast array of potential antigens. Although these sequences are not evolutionarily constrained in the same manner as distinct protein sequences, it is challenging for large language models to accurately predict their structures due to this disparity.
According to Singh, one key reason language models can accurately predict protein structures is that evolutionary pressures constrain amino acid sequences in ways that the model can infer what these constraints imply. “It’s akin to deciphering the rules of grammar by examining the nuances of phrase structure within a sentence, allowing you to infer its intended meaning.”
The researchers developed two novel modules to model these dynamic regions by building upon existing protein sequence models. One module trained on approximately 3,000 antibody-derived sequences from the Protein Data Bank (PDB) database, enabling it to identify patterns and predict structures based on these hypervariable sequences. The opposite module was trained on data correlating approximately 3,700 antibody sequences with their binding affinity to three distinct antigens.
The ensuing computational model, known as AbMap, is capable of predicting antibody structures and binding energies based solely on the amino acid sequences. Researchers utilized the mannequin to predict antibody configurations that would effectively neutralize the SARS-CoV-2 virus’s spike protein, thereby showcasing its practical application.
Scientists started with a pool of antibodies previously forecast to interact with the target, subsequently engineered tens of thousands of mutants by modifying their hyper-variable regions. Their artificial intelligence framework was capable of predicting antibody conformations that would prove the most effective, with precision exceeding traditional protein-structure models derived from large language models.
The researchers further refined their analysis by grouping the antibodies into clusters exhibiting analogous structural characteristics. The team chose antibodies from each cluster and collaborated with scientists at Sanofi to validate the findings through experimental testing. The studies revealed that a staggering 82% of these antibodies exhibited elevated binding energy compared to the original antibodies incorporated into the model.
Early identification of a diverse pool of promising candidates can help pharmaceutical companies avoid expending significant resources on tests that ultimately yield unsuccessful results, the study suggests?
“To avoid risk, they’re diversifying their investments,” Singh explains. As a professional editor, here’s an improved version of the text in a different style:
They’re hesitant to admit that taking this single antibody through preclinical trials has led to concerning toxicities. They would ideally possess a range of viable options and diversify their investments, ensuring they have alternative choices if one scenario fails.
Researchers may also strive to provide answers to longstanding queries on how individuals with distinct immunological responses diverge in their reactions to infections. What underlies the varying degrees of vulnerability to viral infections?
Researchers are striving to unravel the mysteries by conducting single-cell RNA sequencing on immune cells from individuals, followed by an examination known as antibody repertoire analysis. Research has consistently shown that the antibody repertoires of individuals with vastly distinct backgrounds may share a surprisingly low overlap of around 10%. While sequence data can provide some insight into antibody efficiency, it is limited in its ability to fully capture the complexity of antibody function, since two antibodies with distinct sequences may possess similar structures and properties.
A brand-new approach could potentially address this issue by rapidly generating constructs for every antibody present in an individual. The study found that taking construction into account revealed a significantly higher degree of similarity among individuals than the mere 10% observed in DNA sequence comparisons. Researchers intend to investigate further how these molecular frameworks might bolster the body’s overall immune defense against a particular microorganism.
“That’s where a language model excels, leveraging the scalability of sequence-based assessment while achieving the accuracy of structure-based evaluation,” Singh says.
The analysis received funding from Sanofi and the Abdul Latif Jameel Clinic for Machine Learning in Healthcare.