Thousands of previously unknown genes remain concealed within the enigmatic regions of our genome, often referred to as the “dark matter”.
Long considered evolutionary byproducts, scientists have found that small DNA sequences can actually produce mini-proteins, potentially unlocking a vast array of treatments, including vaccines and immunotherapies for deadly brain tumors.
The unreviewed preprint comes from an international consortium actively identifying novel genes. Since the dawn of the century, scientists have striven to unravel the genetic blueprint of life, a quest initiated by the completion of the human genome’s initial draft. Dispersed within the DNA’s four nucleotide bases – adenine (A), thymine (T), cytosine (C), and guanine (G) – lies a treasure trove of information that could help elucidate the mysteries surrounding our most vexing medical adversaries, including cancer.
The preliminary findings of the Human Genome Mission arrived as a shocking revelation. Researchers have identified fewer than 30,000 genes responsible for constructing and maintaining the human body – significantly fewer than the previously estimated total. Now, roughly 20 years later, as advances in the applied sciences have enabled increasingly sophisticated DNA sequencing and protein mapping technologies, researchers are posing a question:
New research has filled a significant gap by exploring previously understudied regions of the genome. Known for their lack of coding potential, these non-protein-coding elements have yet to be directly associated with protein functions. By integrating multiple existing datasets, researchers identified tens of thousands of novel gene candidates that collectively encode approximately 3,000 short proteins.
While the functionality of these proteins remains unclear, initial studies suggest they may play a role in a deadly childhood brain cancer’s development. The research team is sharing its findings and methods with the global scientific community to facilitate further investigation. The platform’s capabilities extend beyond decoding the human genome, potentially exploring the genetic blueprints of various animal and plant species as well.
Despite the persistence of mysteries, recent findings help complete the picture of the coding portion of the genome by providing a more comprehensive understanding.
What’s in a Gene?
A genome is akin to an unedited manuscript without punctuation. Sequencing one is straightforward currently due to. Understanding its sole remaining concern.
Despite the significance of the Human Genome Mission, researchers have diligently sought out the precise sequences of DNA known as “genes” that ultimately dictate protein production. These DNA sequences are further broken down into three-letter codons, each encoding a specific amino acid – the fundamental building blocks of proteins.
When a gene is activated, it gets transcribed into messenger RNA (mRNA). These messenger RNA molecules transport genetic information from DNA to the ribosomal protein synthesis factory. Imagine this: a slice of bun, with an RNA molecule wrapped around it like a strip of crispy bacon.
Scientists initially encounter open reading frames when defining a gene. DNA sequences determine the specific regions where genes start and end, dictating their expression. During a search operation, the framework comprehensively surveys the genome to identify potential gene candidates, which are subsequently verified through laboratory experiments grounded in diverse criteria. Do these viruses have the capacity to produce proteins exceeding a length of 100 amino acids? The genomic data sets that satisfy the stringent criteria are consolidated into a comprehensive, globally recognized repository of validated gene sequences.
The genes responsible for producing proteins have garnered significant attention due to their crucial role in illuminating disease mechanisms and informing potential treatment strategies. Although much of our genome is comprised of non-coding regions, where vast segments fail to produce recognizable proteins.
For centuries, these seemingly redundant stretches of DNA were dismissed as genetic detritus—fossilized remnants of our evolutionary history. Recent studies are progressively uncovering previously unknown value. Regulatory elements govern the timing of gene activation and repression. While telomere-like structures offer protection against DNA degradation during cell replication and subsequent aging, they are crucial in chasing away the effects of growing older.
Notwithstanding the prevailing dogma, it was believed that such sequences do not generate proteins.
A New Lens
Evidence is mounting that non-coding regions of DNA do contain functional protein-coding elements with significant implications for human health.
Scientists found that a tiny missing piece in regions previously thought to be non-coding DNA caused inherited digestive issues in newborns. When mice were genetically modified to mimic the condition’s exact manifestation, deleting the DNA segment, or gene, eliminated their symptoms. The findings highlight the need to move beyond traditional protein-coding genes in order to fully understand scientific discoveries, the researchers argue.
Dubbed non-canonical open reading frames (ncORFs), also known as “maybe-genes,” these short segments of DNA have unexpectedly emerged in various human cell types and diseases, hinting at potential physiological functions.
In 2022, the consortium driving groundbreaking research began exploring novel genetic terrain, seeking to expand our linguistic grasp of inherited traits. Rather than focusing on sequencing genomes, researchers examined RNA datasets that were being translated into proteins within ribosomes.
The strategy accurately reproduces the exact output of the genome—ensuring even the swiftest amino acid sequences, once considered too brief to form proteins, are successfully replicated. Researchers discovered a comprehensive catalogue of more than 7,000 potential human genes, several of which yielded microproteins that were subsequently identified within the majority of cancerous and cardiac cells.
Although the researchers initially focused on the genetic aspects, they omitted a crucial investigation into the actual functioning and quantity of proteins produced as a result. Researchers expanded their partnership by inviting experts in protein science from more than 20 international institutions to decipher the “maybe-genes” and advance understanding.
The platform further integrated numerous assets comprising diverse protein databases, including those derived from initiatives like the Human Protein Atlas and the Proteomics Identifier (PRIDE), as well as insights garnered from in vitro experiments leveraging the human immune system to identify protein fragments.
The team examined more than 7,000 “maybe-genes” across diverse cell types, including wholesome, cancerous, and immortal cell lines cultured in the laboratory. At least one-quarter of these “maybe-genes” were converted into more than 3,000 miniproteins. These peptides are significantly smaller than typical proteins, boasting a remarkably consistent amino acid composition. The newly discovered cells seem to possess an innate ability to harmonize with the immune system’s various components, potentially enabling breakthroughs in vaccine development, autoimmune treatments, and immunotherapy research.
Some of these newly found miniproteins may not occupy an organic role whatsoever. The research offers scientists a novel approach for deciphering potential characteristics. To ensure high-quality management, the team categorized each miniprotein into a unique tier, prioritizing those with substantial experimental evidence, and made them accessible for further exploration by others.
We’re only just beginning to explore the mysteries of our genome’s hidden potential. Many questions stay.
“One of the key strengths of our multi-consortium partnership is its ability to achieve a unified understanding on pressing issues that require innovative solutions.”
Some studies employed the majority of cancerous cells, raising the possibility that so-called “maybe-genes” might only be active in these cells – but not in normal ones. Should genetic material responsible for hereditary traits really be called genes?
Through rigorous research and diverse AI approaches, it may be possible to accelerate the assessment process. While traditional gene annotation relies on manual review of information, researchers propose that AI can rapidly identify novel genes by analyzing multiple datasets.
Scientists may uncover a plethora of groundbreaking discoveries. According to research creator Thomas Martinez, the figure of 50,000 falls squarely within the realm of risk.