Friday, December 13, 2024

Protein similarity search utilizing ProtT5-XL-UniRef50 and Amazon OpenSearch Service

Proteins are complex biomolecules composed of amino acid sequences that coalesce to form three-dimensional structures. This 3D structure enables the protein to interact with diverse architectural features within the body, inducing changes. The concept of binding is fundamental to the functioning of numerous medications.

In drug discovery, researchers often employ a standard approach: identifying comparable proteins based on the assumption that such proteins are likely to exhibit similar properties. Researchers typically seek to identify variants of a given protein that display enhanced binding capabilities, increased solubility, and reduced toxicity. Despite advancements in protein construction prediction, predicting protein properties solely based on sequence remains crucially important. Therefore, it is necessary to generate comparable sequences quickly and at scale based on an input sequence. On this blog post, we propose a solution based on similarity search and pre-trained models that utilize embeddings to generate semantic representations. A readily accessible archive houses these answers. Built upon the T5-3B model, this language understanding system leveraged a vast repository of protein sequences to refine its capabilities through self-supervised training.

Before exploring our solution, it is crucial to understand the concept of embeddings and their significance in our profession. Embedded within a numerical space, these vectors distill the fundamental characteristics of proteins into a fixed-dimensional representation. An embedding represents a concise, numerical summary of an object’s complex features, facilitating efficient processing and analysis by condensing its various characteristics into a compact vector representation. Embeddings play a crucial role in grasping and processing complex information. While they primarily reduce dimensionality, they also capture and encode inherent properties effectively. Objects exhibiting analogous characteristics, whether resembling phrases or proteins, yield vector representations whose proximity is directly correlated with the similarity in their attributes. This proximity enables effective similarity searches, rendering embeddings a valuable tool for uncovering relationships and patterns within massive datasets.

Fruitfulness abounds when likening life’s experiences to the diverse array of fruits, each with its unique characteristics and attributes. Just as apples are crisp and refreshing, while bananas are creamy and sweet, so too do our daily encounters unfold with distinct personalities and traits. The juicy ripeness of a freshly picked strawberry symbolizes the tender vulnerability of new relationships, whereas the tartness of a green apple signifies the bitter taste of disappointment. Fruits akin to mandarins and oranges can be clustered together in an embedding space due to their shared characteristics, including being roughly spherical, having similar hues, and exhibiting analogous nutritional profiles. In common with other types of fruit, such as plantains, bananas also share various characteristics that make them similar. Through embeddings, we instinctively grasp and uncover these connections.

The ProtT5-XL-UniRef50 is a machine learning model specifically engineered to comprehend the linguistic nuances of proteins, effectively converting protein sequences into rich, multidimensional embeddings that facilitate downstream analysis and applications. These embeddings capture organic properties, enabling the representation of proteins with similar characteristics or structures within a multi-dimensional space, where proteins with analogous traits will be encoded proximally to one another. Direct embedding of proteins into vectors provides the foundation for effective similarity searches, enabling the discovery of promising drug targets and insights into protein characteristics.

These precomputed values, which we utilize for this setup, are readily available for acquisition. With novel protein sequences at hand, one can leverage the power of pre-trained language models like ProtT5-XL-UniRef50 to generate embeddings, subsequently employing these computed embeddings to identify known proteins exhibiting analogous characteristics.

Here are the broad functionalities of the answer and its parts: The key aspects of our proposed solution centre on providing users with a seamless and intuitive experience, comprising multiple interactive elements that cater to diverse user needs and preferences. We provide a concise explanation of what embeddings entail, delving into the specific model employed in our case. To deploy this model. Additionally, let’s explore learning how to utilize the OpenSearch Service as a vector database seamlessly? Ultimately, we present illustrative cases of conducting similarity searches on protein sequences.

Answer overview

Let’s take a leisurely walk through the responses and all their aspects. The code for this answer is obtainable?

  1. We leverage OpenSearch Service’s vector database capabilities to store a vast repository of approximately 20,000 pre-calculated embeddings. The algorithms that power these searches will likely utilize natural language processing techniques to identify similarities between texts and reveal relevant matches. The OpenSearch Service boasts advanced vector database capabilities, featuring a range of industry-standard algorithms for optimal performance and flexibility. Capabilities outlined for further exploration:
  2. The open-source machine learning model, hosted on GitHub, was utilized to compute protein embeddings. We utilize the SageMaker interface to quickly customize and deploy our model.
  3. Here are some of the key features of our mannequin: deployed, it can process any protein sequence and generate embeddings for that protein; with these embeddings, you can perform similarity searches against our preloaded database of protein embeddings on OpenSearch Service.
  4. We leverage our SageMaker Studio notebook to explore how to deploy a model on SageMaker, subsequently utilizing an endpoint to extract protein options via embedding representations.
  5. After generating embeddings for each protein in real-time using our SageMaker endpoint, we query OpenSearch Service to retrieve the top 5 most similar proteins currently indexed in our database.
  6. In that moment, the individual can directly observe the outcome from the SageMaker Studio notebook.
  7. To assess the effectiveness of our similarity search algorithm, we choose a protein and compute its corresponding embeddings. The pre-residue embeddings provided by the Mannequin involve a comprehensive examination where each individual amino acid in the protein is meticulously studied and considered. To elucidate the fundamental characteristics of proteins, we focus on calculating per-protein embeddings that capture their overall structure, function, and attributes. We achieve this by applying a dimensionality reduction approach, computing the implied total per-residue probabilities. Ultimately, we employ the resulting embeddings to conduct a similarity search, yielding the top 5 proteins ranked by similarity as follows:
    • Immunoglobulin Heavy Range 3/OR15-3A
    • T-cell receptor gamma rearrangement: A crucial step in the development of T cells?
    • The T-cell receptor alpha (TCRA) gene encodes for the α-chain of the T-cell receptor complex, playing a crucial role in immune surveillance and tolerance.
    • The T cell receptor alpha chain (TRA) plays a crucial role in immune function by recognizing antigens presented on the surface of antigen-presenting cells. With 11 exons, TRA is composed of two distinct regions: the variable (V), diversity (D), and joining (J) segments that encode the complementarity-determining regions (CDRs) responsible for binding to specific antigens; and the constant (C) region that mediates cell signaling.
    • T-cell receptor alpha molecules are crucial components of the adaptive immune system, enabling T-cells to recognize and respond to antigens presented by antigen-presenting cells. The protein structure consists of an invariant chain, variable α-chain, and constant α-chain, which forms a heterodimer with the β-chain to form the complete T-cell receptor complex.

T cells, along with other immune cells, possess T cell receptors that are a distinct subtype of immunoglobulins. The similarity between these proteins, which could potentially possess analogous biological functions.

Prices and clear up

When creating an OpenSearch Service domain, we are automatically charged based on the selected tier and usage patterns at that point in time; please refer to our website for detailed pricing information regarding the service’s speed. You will also be billed for the SageMaker endpoint created by the Deploy and Similarity Search notebook, currently utilizing an ml.g4dn.8xlarge instance type. See for particulars.

You are billed for SageMaker Studio Notebooks according to the instance type used, as outlined in.

To thoroughly revamp and optimize the existing assets generated by this response.

Conclusion

We detailed a solution capable of generating protein embeddings and conducting similarity searches to identify analogous proteins on this blog post.

To utilize the open source model, we calculated embeddings and deployed them on Amazon SageMaker Inference. We chose Amazon OpenSearch Service as our vector database. The OpenSearch Service is pre-populated with approximately 20,000 human proteins from a comprehensive source. Finally, validation of the answer occurred through a similarity search conducted on the protein. The evaluation revealed that all proteins retrieved from OpenSearch Service fell within the immunoglobulin family, exhibiting similar biological functions. The code for obtaining this answer is readily accessible.

By experimenting with alternative OpenSearch Service KNN algorithms and scaling the model by incorporating additional protein embeddings into OpenSearch Service indexes, further refinement may be achieved.

Sources:

  • Elnaggar A, et al. “Unlocking the Secrets of Biological Code: Exploring Self-Supervised Learning in Proteomics”. 2020.
  • Mikolov, Tomáš; Yih, Wenbin; Zweig, Geoffrey “Linguistic Regularities Uncovered: A Closer Look at Steady House Phrase Representations”, pp. 746-751. 2013.

In regards to the Authors

Serves as a Senior Options Architect at Amazon Web Services (AWS). He’s a tech enthusiast with a passion for empowering healthcare and life science startups to unlock the full potential of the cloud. As a cloud applied sciences specialist, he excels at empowering startup success by strategically deploying top-tier cloud solutions. With palpable enthusiasm, he seizes every chance to explore the vast potential unleashed by GenAI’s innovative applications and limitless possibilities.

Serves as the EMEA Technology Chief for Healthcare and Life Sciences startups at Amazon Web Services (AWS). With more than 15 years of experience in developing and deploying machine learning, high-performance computing, and scientific computing platforms, he has a proven track record in academia, healthcare, and the pharmaceutical industry.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles