User:Ganeshmanohar/sandbox

The coding sequences of eukaryotic genes are split into short coding sequence segments (exons) and long non-coding sequences (introns) that intervene the exons. As the split gene structure is central to eukaryotic biology, the question of why, how and when introns came into the eukaryotic genes, what intron sequences are, and why eukaryotic genes are split are extremely important.

Periannan Senapathy proposed the “split gene” theory to explain the origin of introns. This theory provides comprehensive and tenable solutions to the key questions concerning the split genes, including the exons, introns, splice junctions, branch points and the entire split gene architecture, based on the origin of split genes from random genetic sequences. It also provides possible solutions to the origin of the spliceosomal machinery, the nuclear boundary and the eukaryotic cell.

The details of how the split gene theory was formulated, and how this theory is corroborated in every aspect of the genetic elements of the eukaryotic gene by the published literature, are provided below.

Background
Genes of all organisms, except bacteria, consist of short protein-coding regions (exons) interrupted by long sequences that intervene the coding sequences (introns). When a gene is expressed, its DNA sequence is copied into a “primary RNA” sequence by the enzyme RNA polymerase. Then the “spliceosome” machinery physically removes the introns from the RNA copy of the gene by the process of splicing, leaving only a contiguously connected series of exons, which becomes the “messenger” RNA (mRNA). This mRNA is now “read” by another cellular machinery, called the “ribosome,” to produce the encoded protein. Thus, although introns are not physically removed from a gene, a gene's sequence is read as if introns never existed.

The exons are usually very short, with an approx. average length of about 120 bases (e.g. in human genes). The length of introns varies widely between 10 bases to 500,000 bases in a genome (for example, the human genome), but the length of exons has an upper limit of about 600 bases in most of the eukaryotic genes. Because exons code for protein sequences, they are very important for the cell, yet constitute only ~2% of the genes’ sequences. Introns, in contrast, constitute 98% of the genes’ sequences but seem to have little crucial functions in genes, except for functions such as containing enhancer sequences and developmental regulators in rare instances.

Until Philip Sharp from the MIT and Richard Roberts then at the Cold Spring Harbor Laboratories (currently at the New England Biolabs) discovered introns within eukaryotic genes in 1977, it was believed that the coding sequence of all genes was always in one single stretch, bounded by a single long Open Reading Frame (ORF). The discovery of introns was a profound surprise to scientists, which instantly brought up the questions of how, why and when the introns came into the eukaryotic genes.

It soon became apparent that a typical eukaryotic gene was interrupted at many locations by introns, dividing the coding sequence into many short exons. Also surprising was that the introns were very long, even as long as hundreds of thousands of bases (see table below). These findings also prompted the questions of why many introns occur within a gene (for example, ~312 introns occur in the human gene TTN), why they are very long, and why exons are very short. It was also discovered that the spliceosome machinery was very large and complex with ~300 proteins and several SnRNA molecules. So, the questions also extended to the origin of the spliceosome. Soon after the discovery of introns, it became apparent that the junctions between exons and introns on either side exhibited specific sequences that signalled the spliceosome machinery to the exact base position for splicing. How and why these splice junction signals came into being was another important question to be answered.

Early speculations
The startling discovery of introns and the split gene architecture of the eukaryotic genes was dramatic, and started a new era of eukaryotic biology. The question of why eukaryotic genes had a genes-in-pieces architecture prompted speculations and discussions in the literature almost immediately.

Ford Doolittle from the Dalhousie University published a paper in 1978 in which he expressed his views. He stated that most molecular biologists assumed that the eukaryotic genome arose from a ‘simpler’ and more ‘primitive’ prokaryotic genome rather like that of Escherichia coli. However, this type of evolution would require that introns be introduced into the contiguous coding sequences of bacterial genes. Regarding this requirement, Doolittle said, “It is extraordinarily difficult to imagine how informationally irrelevant sequences could be introduced into pre-existing structural genes without deleterious effects.” He stated “I would like to argue that the eukaryotic genome, at least in that aspect of its structure manifested as ‘genes in pieces’ is in fact the primitive original form.”

James Darnell from the Rockefeller University also expressed similar views in 1978. He stated, “The differences in the biochemistry of messenger RNA formation in eukaryotes compared to prokaryotes are so profound as to suggest that sequential prokaryotic to eukaryotic cell evolution seems unlikely. The recently discovered non-contiguous sequences in eukaryotic DNA that encode messenger RNA may reflect an ancient, rather than a new, distribution of information in DNA and that eukaryotes evolved independently of prokaryotes.”

However, in an apparent attempt to reconcile with the idea that RNA preceded DNA in evolution, and with the concept of the three evolutionary lineages of archea, bacteria and eukarya, both Doolittle and Darnell deviated from their original speculation in a paper they published together in 1985. They suggested that the ancestor of all three groups of organisms, the ‘progenote,’ had a genes-in-pieces structure, from which all three lineages evolved. They speculated that the precellular stage had primitive RNA genes which had introns, which were reverse transcribed into DNA and formed the progenote. Bacteria and archea evolved from the progenote by losing introns, and ‘urkaryote’ evolved from it by retaining introns. Later, the eukaryote evolved from the urkaryote by evolving a nucleus and gaining the mitochondria from the bacteria. Multicellular organisms then evolved from the eukaryote.

These authors were able to predict that the distinctions between the prokaryote and the eukaryote were so profound that the prokaryote to eukaryote evolution was not tenable, and that both had different origins. However, other than the speculations that the precellular RNA genes must have had introns, they did not address the key questions of where from, how or why the introns could have originated in these genes or what their material basis was. There were no explanations of why exons were short and introns were long, how the splice junctions originated, what the structure and sequence of the splice junctions meant, and why eukaryotic genomes were large.

Around the same time that Doolittle and Darnell suggested that introns in eukaryotic genes could be ancient, Colin Blake from the university of Oxford and Walter Gilbert from the Harvard University (who won the Nobel Prize for inventing a DNA sequencing method along with Fred Sanger) published their views on intron origins independently. In their view, introns originated as spacer sequences that enabled the recombination and shuffling of exons that encoded distinct functional domains in order to evolve new genes. Thus, new genes were assembled from exon modules that coded for functional domains, folding regions, or structural elements from preexisting genes in the genome of an ancestral organism, thereby evolving genes with new functions. They did not specify how the exons representing protein structural motifs originated, or the introns that do not code for proteins originated. In addition, even after many years, extensive analysis of several thousands of proteins and genes showed that only extremely rarely do genes exhibit the supposed exon shuffling phenomenon. Furthermore, several molecular biologists had questioned the exon shuffling proposal, from a purely evolutionary view for both methodological and conceptual reasons, and, in the long run, this theory did not materialize.

Hypothesis
Around the same time introns were discovered, Senapathy was asking how genes themselves could have originated. He surmised that for any gene to come into being, there must have been genetic sequences (RNA or DNA) present in the prebiotic chemistry environment. A basic question he asked was how protein-coding sequences could have originated from primordial DNA sequences at the initial development of the very first cells.

To answer this, he made two basic assumptions: (i) before a self-replicating cell could come into existence, DNA molecules were synthesized in the primordial soup by random addition of the 4 nucleotides without the help of templates and (ii) the nucleotide sequences that code for proteins were selected from these preexisting random DNA sequences in the primordial soup, and not by construction from shorter coding sequences. He also surmised that codons must have been established prior to the origin of the first genes. If primordial DNA did contain random nucleotide sequences, he asked: Was there an upper limit in the coding-sequence lengths, and, if so, did this limit play a crucial role in the formation of the structural features of genes at the very beginning of the origin of genes?

His logic was the following. The average length of proteins in living organisms, including the eukaryotic and bacterial organisms, was ~400 amino acids. However, there existed much longer proteins, even longer than 10,000 amino acids up to ~30,000 amino acids, in both eukaryotes and bacteria. Thus, the coding sequence of thousands of bases existed in a single stretch in bacterial genes. In contrast, the coding sequence of eukaryotes existed only in short segments of exons of approx. 120 bases regardless of the length of the protein. If the coding sequence ORF lengths in random DNA sequences were as long as those in bacterial organisms, then contiguously long coding genes were possible to have occurred in random DNA. This was not known, as the distribution of the lengths of ORFs in a random DNA sequence was never studied before.

As random DNA sequences could be generated in the computer, Senapathy thought that he could ask these questions and conduct his experiments in the computer. Furthermore, when he began studying this question, there existed just about sufficient amount of DNA and protein sequence information in the National Biomedical Research Foundation (NBRF) database in the early 1980s.

Origin of introns and the split gene structure
Senapathy analyzed the distribution of the ORF lengths in computer-generated random DNA sequences first. Surprisingly, this study revealed that there actually existed an upper limit of about 200 codons (600 bases) in the lengths of ORFs. The shortest ORF (zero base in length) was the most frequent. At increasing lengths of ORFs, their frequency decreased logarithmically, reaching almost zero at about 600 bases. When the probability of ORF lengths in a random sequence was plotted, it also revealed that the  probability of increasing lengths of ORFs decreased exponentially and tailed off at a maximum of about 600 bases. From this “negative exponential” distribution of ORF lengths, it was found that most of the ORFs were extremely shorter than even the maximum of 600 bases.This finding was surprising because the coding sequence for the average protein length of 400 AAs (with ~1,200 bases of coding sequence) and longer proteins of thousands of AAs (requiring >10,000 bases of coding sequence) would not occur at a stretch in a random sequence. If this was true, a typical gene with a contiguous coding sequence could not originate in a random sequence. Thus, the only possible way that any gene could originate from a random sequence was to split the coding sequence into shorter segments and select these segments from short ORFs available in the random sequence, rather than to increase the length of an ORF by eliminating numerous consecutively occurring stop codons. This process of choosing short segments of coding sequences from the available ORFs to make a long ORF would lead to a split structure of the gene.

This Split Gene Theory led to the Shapiro-Senapathy algorithm, which provides the methodology for detecting the splice sites, exons and split genes in a eukaryotic DNA, and which is the main method for detecting splice site mutations in genes that cause hundreds of diseases in thousands of patients worldwide.

If this hypothesis was true, eukaryotic DNA sequences should show evidence for it. When Senapathy plotted the distribution of ORF lengths in eukaryotic DNA sequences, the plot was remarkably similar to that from random DNA sequence. This plot was also a negative exponential distribution that tailed off at a maximum of about 600 bases. This finding was amazing because the exons from eukaryotic genes also exhibited a maximum length of about 600 bases, which coincided exactly with the maximum length of ORFs observed in both random DNA sequence and in eukaryotic DNA sequence.

The split genes thus originated from random DNA sequences by choosing the best of the short coding segments (exons) and joining them by a process of splicing. The intervening intron sequences were left-over vestiges of the random sequences, and thus were earmarked to be removed by the spliceosome. These findings indicated that split genes could have originated from random DNA sequences with exons and introns as they are found in today's eukaryotic organisms. The Nobel Laureate Marshall Nirenberg, who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid. New Scientist covered this publication in “A long explanation for introns”.

Noted molecular biologist Dr. Colin Blake from the university of Oxford, who proposed the Gilbert-Blake hypothesis in 1979 for the origin of introns (see above), stated in his 1987 publication entitled “Proteins, exons and molecular evolution,” that Senapathy's split gene theory comprehensively explained the origin of the split gene structure. In addition, he stated that it explained several key questions including the origin of the splicing mechanism:

“Recent work by Senapathy, when applied to RNA, comprehensively explains the origin of the segregated form of RNA into coding and non-coding regions. It also suggests why a splicing mechanism was developed at the start of primordial evolution. He found that the distribution of reading frame lengths in a random nucleotide sequence corresponded exactly to that for the observed distribution of eukaryotic exon sizes. These were delimited by regions containing stop signals, the messages to terminate construction of the polypeptide chain, and were thus non-coding regions or introns. The presence of a random sequence was therefore sufficient to create in the primordial ancestor the segregated form of RNA observed in the eukaryotic gene structure. Moreover, the random distribution also displays a cutoff at 600 nucleotides, which suggests that the maximum size for an early polypeptide was 200 residues, again as observed in the maximum size of the eukaryotic exon. Thus, in response to evolutionary pressures to create larger and more complex genes, the RNA fragments were joined together by a splicing mechanism that removed the introns. Hence, the early existence of both introns and RNA splicing in eukaryotes appears to be very likely from a simple statistical basis. These results also agree with the linear relationship found between the number of exons in the gene for a particular protein and the length of the polypeptide chain.”

Origin of splice junctions
Under the split gene theory, an exon would be defined by an ORF. It would require that a mechanism to recognize an ORF should have originated. As an ORF is defined by a contiguously coding sequence bounded by stop codons, these stop codon ends had to be recognized by this exon-intron gene recognition system. This system could have defined the exons by the presence of a stop codon at the ends of ORFs, which should be included within the ends of the introns and eliminated by the splicing process. Thus, the introns should contain a stop codon at their ends, which would be part of the splice junction sequences.

If this hypothesis was true, the split genes of today's living organisms should contain stop codons exactly at the ends of introns. When Senapathy tested this hypothesis in the splice junctions of eukaryotic genes, it was astonishing that the vast majority of splice junctions did contain a stop codon at the ends of introns, right outside of the exons. In fact, these stop codons were found to form the “canonical” GT:AG splicing sequence, with the three stop codons occurring as part of the strong consensus signals. Thus, the basic split gene theory for the origin of introns and the split gene structure led to the understanding that the splice junctions originated from the stop codons.

Surprisingly, all three stop codons (TGA, TAA and TAG) were found after one base (G) at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GT(A/G)GGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. The canonical acceptor splice junction is shown as (C/T)AG:GT, in which TAG is the stop codon. These consensus sequences clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes, thus providing a strong corroboration for the split gene theory. Marshall Nirenberg again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper. New Scientist covered this publication in “Exons, Introns and Evolution”.

Soon after the discovery of introns by Drs. Philip Sharp and Richard Roberts, it became known that mutations within splice junctions could lead to diseases. Senapathy showed that mutations in the stop codon bases (canonical bases) caused more diseases than the mutations in non-canonical bases.

Branch point (lariat) sequence
An intermediate stage in the process of eukaryotic RNA splicing is the formation of a lariat structure. It is anchored at an adenosine residue in intron between 10 and 50 nucleotides upstream of the 3' splice site. A short conserved sequence (the branch point sequence) functions as the recognition signal for the site of lariat formation. During the splicing process, this conserved sequence towards the end of the intron forms a lariat structure with the beginning of the intron. The final step of the splicing process occurs when the two exons are joined and the intron is released as a lariat RNA.

Several investigators have found the branch point sequences in different organisms including yeast, human, fruit fly, rat, and plants. Senapathy found that, in all of these branch point sequences, the codon ending at the branch point adenosine is consistently a stop codon. What is interesting is that two of the three stop codons (TAA and TGA) occur almost all of the times at this position.

These findings led Senapathy to propose that the branch point signal originated from stop codons. The finding that two different stop codons (TAA and TGA) occur within the lariat signal with the branching point as the third base of the stop codons corroborates this proposal. As the branching point of the lariat occurs at the last adenine of the stop codon, it is possible that the spliceosome machinery that originated for the elimination of the numerously occurring stop codons from the primary RNA sequence created an auxiliary stop-codon sequence signal as the lariat sequence to aid its splicing function.

The small nuclear U2 RNA found in splicing complexes is thought to aid splicing by interacting with the lariat sequence. Complementary sequences for both the lariat sequence and the acceptor signal are present in a segment of only 15 nucleotides in U2 RNA. Further, the U1 RNA has been proposed to function as a guide in splicing to identify the precise donor splice junction by complementary base-pairing. The conserved regions of the U1 RNA thus include sequences complementary to the stop codons. These observations enabled Senapathy to predict that that stop codons had operated in the origin of not only the splice-junction signals and the lariat signal, but also some of the small nuclear RNAs.

Gene regulatory sequences
Dr Senapathy also proposed that the gene-expression regulatory sequences (promoter and poly-A addition site sequences) also could have originated from stop codons. A conserved sequence, AATAAA, exists in almost every gene a short distance downstream from the end of the protein-coding message and serves as a signal for the addition of poly(A) in the mRNA copy of the gene. This poly(A) sequence signal contains a stop codon, TAA. A sequence shortly downstream from this signal, thought to be part of the complete poly(A) signal, also contains the TAG and TGA stop codons.

Eukaryotic RNA-polymerase-II-dependent promoters can contain a TATA box (consensus sequence TATAAA), which contains the stop codon TAA. Bacterial promoter elements at -10 bases exhibits a TATA box with a consensus of TATAAT (which contains the stop codon TAA), and at -35 bases exhibits a consensus of TTGACA (containing the stop codon TGA). Thus, the evolution of the whole RNA processing mechanism seems to have been influenced by the too-frequent occurrence of stop codons in DNA sequence, thus making the stop codons the focal points for RNA processing.

Stop codons are key parts of every genetic element in the eukaryotic gene


Senapathy's work based on his split gene theory has unraveled that stop codons occur as the key parts in every genetic element in eukaryotic genes. The table and figure above show that the key parts of the core promoter elements, the lariat (branch point) signal, the donor and acceptor splice signals, and the poly-A addition signal consist of one or more stop codons. This finding provides a strong corroboration for the split gene theory that the underlying reason for the complete split gene paradigm is the origin of split genes from random DNA sequences, wherein random distribution of an extremely high frequency of stop codons were used by nature to define these genetic elements.

Why exons are short and introns are long?
Research based on the split gene theory sheds light on other basic questions of exons and introns. The exons of eukaryotes are generally short (human exons average ~120 bases, and can be as short as 10 bases) and introns are usually very long (average of ~3,000 bases, and can be several hundred thousands bases long), for example genes RBFOX1, CNTNAP2, PTPRD and DLG2. Senapathy has provided a plausible answer to these questions, which has remained the only explanation so far. Based on the split gene theory, exons of eukaryotic genes, if they originated from random DNA sequences, have to match the lengths of ORFs from random sequence, and possibly should be around 100 bases (close to the median length of ORFs in random sequence). The genome sequences of living organisms, for example the human, exhibits exactly the same average lengths of 120 bases for exons, and the longest exons of 600 bases (with few exceptions), which is the same length as that of the longest random ORFs.

If split genes originated in random DNA sequences, then introns would be long for several reasons. The stop codons occur in clusters leading to numerous consecutive very short ORFs, and longer ORFs that could be defined as exons would be rarer. Furthermore, the best of the coding sequence parameters for functional proteins would be chosen from the long ORFs in random sequence, which may occur rarely. In addition, the combination of the donor and acceptor splice junction sequences within short lengths of coding sequence segments that would define exon boundaries would occur rarely in a random sequence. These combined reasons would make introns very long compared to the lengths of exons.

Why eukaryotic genomes are large?
This work also explains why the genomes are very large, for example, the human genome with three billion bases, and why only a very small fraction of the human genome (~2%) codes for the proteins and other regulatory elements. If split genes originated from random primordial DNA sequences, it would contain a significant amount of DNA that would be represented by introns. Furthermore, a genome assembled from random DNA containing split genes would also include intergenic random DNA. Thus, the nascent genomes that originated from random DNA sequences had to be large, regardless of the complexity of the organism.

The observation that the genomes of several organisms such as that of the onion (~16 billion bases ) and salamander (~32 billion bases ) are much larger than that of the human (~3 billion bases ) but the organisms are no more complex than human provides credence to this split gene theory. Furthermore, the findings that the genomes of several organisms are smaller, although they contain essentially the same number of genes as that of the human, such as those of the C. elegans (genome size ~100 million bases, ~19,000 genes) and Arabidopsis thaliana (genome size ~125 million bases, ~25,000 genes), adds support to this theory. The split gene theory predicts that the introns in the split genes in these genomes could be the “reduced” (or deleted) form compared to the larger genes with long introns, thus leading to reduced genomes. In fact, researchers have recently proposed that these smaller genomes are actually reduced genomes, which adds support to the split gene theory.

Origin of the spliceosomal machinery and eukaryotic nucleus
Senapathy's research also addresses the origin of the spliceosomal machinery that edits out the introns from the RNA transcripts of genes. If the split genes had originated from random DNA, then the introns would have become an unnecessary but integral part of the eukaryotic genes along with the splice junctions at their ends. The spliceosomal machinery would be required to remove them and to enable the short exons to be linearly spliced together as a contiguously coding mRNA that can be translated into a complete protein. Thus, the split gene theory shows that the whole spliceosomal machinery originated due to the origin of split genes from random DNA sequences, and to remove the unnecessary introns.

As noted above, Colin Blake, the author of the Gilbert-Blake theory for the origin of introns and exons, states, “Recent work by Senapathy, when applied to RNA, comprehensively explains the origin of the segregated form of RNA into coding and noncoding regions. It also suggests why a splicing mechanism was developed at the start of primordial evolution.”

Senapathy had also proposed a plausible mechanistic and functional rationale why the eukaryotic nucleus originated, a major question in biology. If the transcripts of the split genes and the spliced mRNAs were present in a cell without a nucleus, the ribosomes would try to bind to both the un-spliced primary RNA transcript and the spliced mRNA, which would result in a molecular chaos. If a boundary had originated to separate the RNA splicing process from the mRNA translation, it can avoid this problem of molecular chaos. This is exactly what is found in eukaryotic cells, where the splicing of the primary RNA transcript occurs within the nucleus, and the spliced mRNA is transported to the cytoplasm, where the ribosomes translate them into proteins. The nuclear boundary provides a clear separation of the primary RNA splicing and the mRNA translation.

Origin of the eukaryotic cell
These investigations thus led to the possibility that primordial DNA with essentially random sequence gave rise to the complex structure of the split genes with exons, introns and splice junctions. They also predict that the cells that harbored these split genes had to be complex with a nuclear cytoplasmic boundary, and must have had a spliceosomal machinery. Thus, it was possible that the earliest cell was complex and eukaryotic. Surprisingly, findings from extensive comparative genomics research from several organisms over the past 15 years are showing overwhelmingly that the earliest organisms could have been highly complex and eukaryotic, and could have contained complex proteins,      exactly as predicted by Senapathy's theory.

The spliceosome is a highly complex machinery within the eukaryotic cell, containing ~200 proteins and several SnRNPs. In their paper [34] “Complex spliceosomal organization ancestral to extant eukaryotes,” molecular biologists Lesley Collins and David Penny state “We begin with the hypothesis that ... the spliceosome has increased in complexity throughout eukaryotic evolution. However, examination of the distribution of spliceosomal components indicates that not only was a spliceosome present in the eukaryotic ancestor but it also contained most of the key components found in today's eukaryotes. ... the last common ancestor of extant eukaryotes appears to show much of the molecular complexity seen today.” This suggests that the earliest eukaryotic organisms were highly complex and contained sophisticated genes and proteins, as the split gene theory predicts.

Origin of bacterial genes
Based on the split gene theory, only genes split into short exons and long introns, with a maximum exon length of ~600 bases, could have occurred in random DNA sequences. Genes with long uninterrupted coding sequences that are thousands of bases long and longer than 10,000 bases up to 90,000 bases that occur in many bacterial organisms were practically impossible to have occurred. However, the bacterial genes could have originated from split genes by losing introns, which seems to be the only way to arrive at long coding sequences. It is also a better way than by increasing the lengths of ORFs from very short random ORFs to very long ORFs by specifically removing the stop codons by mutation.

According to the split gene theory, this process of intron loss could have happened from prebiotic random DNA. These contiguously coding genes could be tightly organized in the bacterial genomes without any introns and be more streamlined. According to Senapathy, the nuclear boundary that was required for a cell containing split genes in its genome (see the section Origin of the eukaryotic cell nucleus, above) would not be required for a cell containing only contiguously coding genes. Thus, the bacterial cells did not develop a nucleus. Based on split gene theory, the eukaryotic genomes and bacterial genomes could have independently originated from the split genes in primordial random DNA sequences.

The Shapiro-Senapathy algorithm
Based on the split gene theory, Senapathy developed computational algorithms to detect the donor and acceptor splice sites, exons and a complete split gene in a genomic sequence. He developed the position weight matrix (PWM) method based on the frequency of the four bases at the consensus sequences of the donor and acceptor in different organisms to identify the splice sites in a given sequence. Furthermore, he formulated the first algorithm to find the exons based on the requirement of exons to contain a donor sequence (at the 5’ end) and an acceptor sequence (at the 3’ end), and an ORF in which the exon should occur, and another algorithm to find a complete split gene. These algorithms are collectively known as the Shapiro-Senapathy algorithm (S&S).

This Shapiro-Senapathy algorithm aids in the identification of splicing mutations that cause numerous diseases and adverse drug reactions. Using the S&S algorithm, scientists have identified mutations and genes that cause numerous cancers, inherited disorders, immune deficiency diseases and neurological disorders (see here for details). It is increasingly used in clinical practice and research not only to find mutations in known disease-causing genes in patients, but also to discover novel genes that are causal of different diseases. Furthermore, it is used in defining the cryptic splice sites and deducing the mechanisms by which mutations in them can affect normal splicing and lead to different diseases. It is also employed in addressing various questions in basic research in humans, animals and plants.

The widespread use of this algorithm in biological research and clinical applications worldwide adds credence to the split gene theory, as this algorithm emanated from the split gene theory. The findings based on S&S have impacted major questions in eukaryotic biology and their applications to human medicine. These applications may expand as the fields of clinical genomics and pharmacogenomics magnify their research with mega sequencing projects such as the All of Us project that will sequence a million individuals, and with the sequencing of millions of patients in clinical practice and research in the future.

Corroborating evidence
If the split gene theory is correct, the structural features of split genes predicted from computer-simulated random sequences can be expected to occur in actual eukaryotic split genes. This is what we find in most known split genes in eukaryotes living today. The eukaryotic sequences exhibit a nearly perfect negative exponential distribution of ORFs lengths, with an upper limit of 600 bases (with rare exceptions). Also, with rare exceptions, the exons of eukaryotic genes fall within this 600 bases upper maximum.

Moreover, if this theory is correct, exons should be delimited by stop codons, especially at the 3’ ends of exons (that is, the 5’ end of introns). Actually they are precisely delimited more strongly at the 3’ ends of exons and less strongly at the 5’ ends in most known genes, as predicted. These stop codons are the most important functional parts of both splice junctions (the canonical bases GT:AG). The theory thus provides an explanation for the “conserved” splice junctions at the ends of exons and for the loss of these stop codons along with introns when they are spliced out. If this theory is correct, splice junctions should be randomly distributed in eukaryotic DNA sequences, and they are. The splice junctions present in transfer RNA genes and ribosomal RNA genes, which do not code for proteins and wherein stop codons have no functional meaning, should not contain stop codons, and again, this is observed. The lariat signal, another sequence involved in the splicing process, also contains stop codons.

If the Split Gene theory is true, then introns should be non-coding. This is exactly found in eukaryotic organisms living today, even when they are hundreds of thousands of bases long. They should also be mostly non-functional, as some intron sequences including the donor and acceptor splice signal sequences and branch point sequences, and possibly intron splice enhancers have to occur within the introns, especially at the ends of introns, which aid in the removal of introns. This is what is found to be true. Introns can rarely have some sequences that could fortuitously exhibit functional elements that could be used by the genome and the cell. Again this is found to be true [REF]. These findings fulfill the requirements of the Split Gene theory that introns have to be non-coding and non-functional. Furthermore, they support the concept of the Split Gene theory that the introns are random sequences that came from the primordial random DNA only to be eliminated. All of these findings show that the predictions of the split gene theory concerning the structure and function of the split genes in random DNA sequences are precisely corroborated by the structural and functional characteristics of the major genetic elements in split genes in modern eukaryotic organisms.

If the split genes originated from random primordial DNA sequences, as proposed in the split gene theory, there could be evidence that they were present in the earliest organisms. Actually, using comparative analysis of the modern genome data from several living organisms, scientists have found that the characteristics of split genes that are present in modern eukaryotes trace back to the earliest organisms that came on earth. These studies show that the earliest organisms could have contained the intron-rich split genes and complex proteins that occur in today's living organisms.

In addition, using another computational analytical method known as the “maximum likelihood analysis,” scientists have found that the earliest eukaryotic organisms must have contained the same genes from today's living organisms with even a higher density of introns. Furthermore, comparative genomics of many organisms including basal eukaryotes (considered to be primitive eukaryotic organisms such as Amoeboflagellata, Diplomonadida, and Parabasalia) have shown that intron-rich split genes accompanied by a fully formed spliceosome from today's complex organisms were present in the earliest organisms, and that the earliest organisms were extremely complex with all of the eukaryotic cellular components.

These findings from the literature are exactly as predicted by the split gene theory providing remarkable support. This theory is corroborated by the findings from comparative analysis of actual eukaryotic gene sequences with those of the computer generated random DNA sequences. Furthermore, comparative analysis of genome data from many organisms living today by several groups of scientists show that the earliest organisms that appeared on earth had intron-rich split genes, coding for complex proteins and cellular components, such as those found in the modern eukaryotic organisms. Thus, the split gene theory provides comprehensive solutions to the entire structural and functional features of the split gene architecture, with strong corroborating evidence from published literature.