Biology seems to be a field still shrouded in many mysteries, and the genetics of humanity is no exception. To this day, there are still issues that we cannot explain.
The C-value Paradox
The Great Chain of Being is a hierarchical structure of all matter and life. In this chain, God stands at the top, followed by nine levels of angels, with humans beneath the angels, and below them are animals, plants, and minerals. Thus, the higher the position in the chain, the more attributes it possesses, including all the attributes of those positioned lower.
The Great Chain of Being is a hierarchical structure of all matter.
This chain classifies everything. Fifty years ago, it was believed that the amount of DNA in the genome could rank organisms from top to bottom, similar to how the Great Chain of Being classifies everything.
The idea was that the more complex a species, the more genes it needed. In other words, the number of genes in the genome should be arranged from few to many, such as yeast, roundworms, flies, and humans. Data obtained through sequencing technology at that time seemed to initially confirm this idea.
The C-value range of different species is not a simple incremental relationship.
However, gradually, people began to realize that this line of thinking was not entirely accurate.
As more sequencing results emerged, the complete disconnection between DNA content and organism complexity was repeatedly demonstrated: The C-value (which describes the size of an organism’s genome) varies significantly among species. The C-value range of different species is not a simple incremental relationship and shows considerable variation within each species.
In animals, they differ by more than 3,300 times. In terrestrial plants, they differ by about 1,000. Data shows that the range of DNA sizes in many populations can change across several orders of magnitude. The complexity from algae to mammals does not correlate positively with genome size.
The range of DNA sizes in many populations can change across several orders of magnitude.
In 1971, CA Thomas described this puzzling issue as the C-value paradox, which is often described from three different perspectives:
- (1) Some simple organisms have more DNA than complex organisms. Some relatively primitive organisms, such as gastropods, have a higher C-value than mammals. An amoeba has more than 200 times the amount of DNA per cell compared to humans. Some amphibians have genomes that are 25 times larger than that of humans.
- (2) The genome of any given organism seems to contain more than its predicted number of genes, meaning that the genome can harbor a large amount of DNA segments beyond its genes and their regulatory sequences.
- (3) Some morphologically similar groups exhibit highly variable DNA content. This is particularly common in plants, such as rice (Oryza), millet, or onion (Allium), which differ in haploid genome size by a factor of 3 to 8 times. Unlike the genes and regulatory sequences that we expect to evolve slowly and be conserved, for various reasons, the size of the genome can change rapidly over evolutionary time, as seen in the case of maize (Zea mays), which expanded by about 50% over a span of 140,000 years.
Years after the term C-value was coined, the discovery of a large amount of non-coding DNA explained the second issue. These non-coding genes were initially referred to as junk DNA because it was thought to serve no purpose. In recent years, it has been discovered that non-coding DNA has important functions. However, there will be a separate article focusing on this topic.
Non-coding genes can explain the second issue, but this raises new questions. Which contributes more to biological complexity: coding genes or non-coding genes? Do coding genes correlate with biological complexity after removing coding genes that seem to have no function?
The G-value Paradox
The Human Genome Project (HGP) was officially launched in 1990. The original goal of the HGP was not only to discover all 3 billion base pairs of human genes with the least error rate but also to confirm from a large amount of listed data—all genes and their sequences.
Today, the DNA sequences of humans are stored in databases that anyone can download via the Internet.
DNA segments that carry genetic information are called genes, and they are segments of DNA that can encode RNA or proteins. In the spring of 2000, molecular biologists began to wager, trying to predict the number of genes that could be found once the nucleotide sequencing of the human genome was completed.
On April 14, 2003, the National Human Genome Research Institute (NHGRI), the U.S. Department of Energy (DOE), and their partners in the International Human Genome Sequencing Consortium announced the successful completion of the Human Genome Project. Using data from the HGP, scientists estimated that the human genome contains between 20,000 and 25,000 genes.
The number of genes in the genome should correlate with complexity, hoping that the complexity of organisms could be arranged like yeast, roundworms, flies, and humans. This is an upgraded version of the C-value paradox and is referred to as the G-value paradox.
The assumptions and implications of this question suggest that humans are much more complex than other fully sequenced prokaryotic organisms and therefore must have a correspondingly larger genome, which is difficult to justify based on sequencing results. Interestingly, those who desired to have more genes did not abandon the fight. They continued to publish rationalizing stories, attempting to prove that something was amiss.
At this point, it was a good solution to invent a new concept, a measure that could truly determine the information encoded by the genome, leading to the emergence of the I-value. There are several theories to demonstrate that the G-value of genes does not contain less information, such as:
1. The Combination of Genes
As the number of genes in an organism increases, the combination of coding proteins can interact to perform complex functions at an accelerating rate. This is true for signaling and metabolic protein networks. Just adding 100 genes to our genome would create an additional 3.1 million pairwise combinations.
2. Other Functions per Gene
It seems that we encode a higher ratio of multifunctional proteins in our genome compared to flies and worms; meaning, on average, each protein in our body has more unique biochemical configurations than C. elegans. This is described as a Swiss Army knife.
3. Alternative Splicing: From Genome to Transcriptome
According to the best available estimates, 59% of genes are alternatively spliced during transcription. If we only consider the linking variants affecting the protein-coding region, we obtain about 69,000 different protein sequences encoded by our genome. This is an increase of over 300% in the number of genes. In contrast, the worm genome contains a smaller proportion of alternatively spliced genes, producing up to 25,000 proteins.
4. Post-Translational Modifications: From Transcriptome to Proteome
After translation, many modifications can further increase the number of functionally distinct proteins encoded by a single gene. Common modifications include glycosylation, protein degradation, and phosphorylation. By comparing the human proteome (the total number of proteins in a cell) with the transcriptome (the total number of copies in a cell), we can estimate the prevalence of this mechanism in our genome.
5. Genetic Redundancy: G-value Inflation
40% of the complete loci in the nematode genome are the result of parallel duplication, which may explain why it has a much higher G-value than Drosophila. In mice, the removal of a duplicated gene often has no effect, suggesting significant information redundancy between duplicated loci in the genomes of mammals, leading to an increased G-value while containing the same amount of information.
These explanations may timely resolve the G-value paradox, all striving to provide us with additional information about each gene, and we may underestimate the information encoded by genes if we only consider the numbers.
On the other hand, evolution is not a model of efficiency, and it has taken a convoluted path that has led to gene sets larger than what the organisms themselves require. It resembles a Rube Goldberg machine: “There may be a simpler way to encode our bodies and behaviors than what actually exists in our genes. Counting the number of genes may overestimate the information encoded by those genes.”
The complexity of instructions (genes) and the complexity of the products (organisms) is too intricate to understand the causes and correlations of an organism’s genetic diversity, and it is insufficient to start with humans.
The Earth Biogenome Project (EBP) is a 10-year program aimed at sequencing and cataloging the genomes of all known eukaryotic species currently described on Earth. The plan will establish an open biological information DNA database, and the project was officially launched on November 1, 2018.
For the first time, it will be possible to efficiently sequence the genomes of all known species and use genomic systems to help explore 80% to 90% of the remaining species that have yet to be discovered by the scientific community.