Viruses are in all places. Microorganisms thrive within the air, seep into sewage, lakes, and oceans, and inhabit grasslands and decaying wood. Some organisms thrive in extreme environments, such as the hot, chemical-rich conditions surrounding hydrothermal vents, the freezing temperatures of Antarctic ice, and potentially even the harsh conditions found in outer space.
They’re additionally historical. Some fossils may be as ancient as, if not even older than, the earliest known cells.
Despite centuries of coexistence with viruses, the viral universe remains shrouded in mystery. Scientists have spent decades meticulously collecting samples worldwide and deciphering the genetic codes within them. Despite viruses’ rapid mutation, current methods barely scratch the surface of the vast virosphere.
Researchers have discovered that the most prevalent forms of viral genetic material are comprised of “organic dark matter,” according to a recent study published by Mang Shi and his team at Solar Yat-sen University.
As assisted by AI technology, the workforce is casting a brighter light on the rapidly changing digital landscape. Developed as LucaProt, the artificial intelligence relies on a large language model to decipher fragments of viral genomic data. Another algorithm parses genetic data into concise, easily digestible chunks to further enhance its effectiveness.
Following a thorough examination of nearly 10,500 specimens, comprising both existing database entries and newly acquired data, the AI successfully identified an astonishing 70,458 previously unknown RNA viruses spanning the entire planet.
“As researchers delve deeper into their work, they may unexpectedly stumble upon issues that went unnoticed until then,” Artem Babaian from the University of Toronto noted.
Viruses have a foul repute. The COVID-19 pandemic and annual flu season starkly highlight their devastating consequences. However, viruses are versatile and can also be utilised for battles against diseases, shuttling genetic material into cells, or being engineered to develop effective vaccines.
By charting the viral universe from a bird’s-eye perspective, researchers can gain insight into the evolutionary dynamics and mutational patterns of viruses, with far-reaching implications that extend beyond biotechnology to inform strategies against future pandemics.
Going Viral
DNA contains the genetic instructions that shape an individual’s characteristics and traits. The process begins with DNA being transcribed into RNA, comprising four nucleotide bases that encode genetic information. The RNA molecule then serves as a template for protein synthesis, conveying the encoded instructions to the ribosome, where amino acids are assembled into specific chains, ultimately forming proteins.
Viruses are totally different. Several organisms dispense with DNA and instead encode their genetic instructions directly into RNA. While it may seem unusual at first, you’re likely familiar with several members of this viral family, including SARS-CoV-2, the culprit behind Covid-19, which is an RNA-based pathogen. The viruses harbour proteins shrouded in mystery, with scientific understanding limited to a great extent; however, studying them often yields novel insights into biological processes.
Scientists have spent decades attempting to decipher the mysteries of the virosphere through painstaking sample collection. Sources fluctuate regularly, ranging from a nearby creek’s water to extreme examples akin to Antarctic ice or deep-sea water. High-quality RNA is meticulously extracted from each sample, followed by rigorous sequencing and deposition into relevant databases. This technique, known as metagenomics, captures fragments of viral RNA from the surrounding environment.
Unlocking the secrets of the genetic treasure trove requires a significant amount of effort and dedication. Outdated computational approaches struggle to extract meaningful patterns from these massive datasets.
. Developed by Meta, this system leverages massive language models – the same expertise that powers OpenAI’s ChatGPT and Google’s Gemini – to predict protein structures based on their amino acid sequences. Similar approaches, including DeepMind’s AlphaFold and David Baker’s RoseTTAFold, have recently earned their developers prestigious awards.
ESMfold accurately predicts the three-dimensional structures of proteins at the atomic level from given molecular sequences. Scientists employed the AI for its inaugural task, deciphering the enigmatic molecular structure of proteins within microorganisms that remain shrouded in mystery. In the past year, researchers successfully employed AI to accurately predict the structure of proteins derived from microorganisms. The discovery of 10% has left scientists completely unaware of anything similar previously unearthed.
Noting the observation, Shi’s team inquired whether this approach could be applied to RNA viruses as well?
Panning for Viruses
Researchers have leveraged artificial intelligence to scour through an astronomical dataset equivalent to approximately 500 million high-definition images, thereby identifying promising new RNA virus candidates.
The studies focused specifically on RNA-dependent RNA polymerase (RdRP). The RNA sequences encode retrograde-dependent replication proteins (RdRPs), a family of proteins that target most RNA virus genomes. identified over 132,000 novel RNA viruses primarily through genomic analysis.
The issue? Viruses quickly mutate. As the genetic codes for RdRPs evolve, AI algorithms trained on these patterns may struggle to identify and classify newly emerged viral strains with altered RdRP profiles? A novel research approach integrated a proven methodology with ESMFold to develop a powerful two-channel artificial intelligence system.
Utilizing a transformer-based model akin to that employed by ChatGPT, the primary channel identifies and extracts amino acid sequence “key phrases” from a vast database, specifically targeting viral RdRP sequences. Following coaching on designated patterns and several randomly generated series, the AI developed a comprehensive library comprising approximately 20,000 recurrent protein sequences that encode for RdRPs.
In contrast to previous approaches, this innovation dissects genetic databases into manageable fragments, enabling the AI to efficiently process extended DNA sequences and identify viral RdRP proteins with greater ease.
The second channel taps into a model of ESMFold. The tentative yet inquiring reader is gradually becoming more acquainted with new ideas and perspectives. In stark contrast to simply stringing together protein phrases, the program meticulously examines each individual letter, predicting exactly how each structural component interacts with its neighbors to forge intricate 3D protein conformations. This crucial step anchors the AI’s understanding of how Replication-Defective Retroviral Particles (RdRPs) should appear in living viruses.
The LucaProt database contains a vast repository of information, featuring approximately 6,000 sequences that encode RdRP proteins, as well as an impressive 229,500 sequences identified as encoding distinct proteins. Equipped with a benchmark dataset where the desired outcomes were readily available, the AI demonstrated outstanding accuracy, committing only 0.014 percent of errors in the form of false positives.
The artificial intelligence has uncovered an astonishing 70,458 novel, distinct viral strains.
“One isolate, remote from dust, boasted an unexpectedly lengthy genome – one of the longest RNA viruses documented to date, according to the researchers.” Some microorganisms might flourish in scorching hot springs and incredibly saline lakes.
The vastly expanded virosphere offers a vast array of newly identified viruses to acknowledged viral teams, including, for example, the culprit behind hepatitis and yellow fever. Luca Prot’s discovery revealed the presence of 60 distinct viral groups, each exhibiting an unprecedented level of uniqueness compared to all known viruses currently.
While it’s incorrect to imply that these triggers cause illnesses, it is also essential to note that they have traditionally been overlooked in early RNA virus research endeavors.
The study uncovered “remotely located, isolated hotspots of RNA virus diversity nestled deep within the vast expanse of evolutionary history.”
A Viral Hit?
Viral survival hinges on the presence of a suitable host. The workforce is enhancing its artificial intelligence capabilities to predict and identify potential security threats on servers and hosts. Most RNA viruses infect eukaryotic organisms, including plants, animals, and humans. Viruses have been known to infect microorganisms, a game of evasion that inspired the development of the CRISPR-Cas9 gene editing tool.
“The evolutionary history of RNA viruses stretches back far further in time than that of cellular organisms themselves,” the authors claimed.
Frequently overlooked is the vital third domain of life, archaea. During the emergence of life on Earth, primitive organisms exhibited striking parallels with modern microorganisms and eukaryotes, particularly in their genetic replication processes.
While archaea are undeniably a distinct branch of life, they surprisingly flourish in extreme environments, such as hydrothermal vents or extremely saline waters. It appears that there are indications suggesting RNA viruses might also infect archaea. If this is indeed the scenario, it could potentially stimulate fresh perspectives on our understanding of the tree of life, ultimately giving rise to innovative biotechnological advancements.