Structural variation in the human genome



Structural variation in the human genome is operationally defined as genomic alterations, varying between individuals, that involve DNA segments larger than 1 kilo base (kb), and could be either microscopic or submicroscopic. This definition distinguishes them from smaller variants that are less than 1 kb in size such as short deletions, insertions, and single nucleotide variants.

Humans have an incredibly complex and intricate genome that has been shaped and modified over time by evolution. About 99.9% of the DNA-sequence in the human genome is conserved between individuals from all over the world, but some variation does exist. Single nucleotide polymorphisms (SNPs) are considered to be the largest contributor to genetic variation in humans since they are so abundant and easily detectable. It is estimated that there are at least 10 million SNPs within the human population but there are also many other types of genetic variants and they occur at dramatically different scales. The variation between genomes in the human population range from single nucleotide polymorphisms to dramatic alterations in the human karyotype.

Human genetic variation is responsible for the phenotypic differences between individuals in the human population. There are different types of genetic variation and it is studied extensively in order to better understand its significance. These studies lead to discoveries associating genetic variants to certain phenotypes as well as their implications in disease. At first, before DNA sequencing technologies, variation was studied and observed exclusively at a microscopic scale. At this scale, the only observations made were differences in chromosome number and chromosome structure. These variants that are about 3 Mb or larger in size are considered microscopic structural variants. This scale is large enough to be visualized using a microscope and include aneuploidies, heteromorphisms, and chromosomal rearrangements. When DNA sequencing was introduced, it opened the door to finding smaller and incredibly more sequence variations including SNPs and minisatellites. This also includes small inversions, duplications, insertions, and deletions that are under 1 kb in size. In the human genome project the human genome was successfully sequenced, which provided a reference human genome for comparison of genetic variation. With improving sequencing technologies and the reference genome, more and more variations were found of several different sizes that were larger than 1 kb but smaller than microscopic variants. These variants ranging from about 1 Kb to 3 Mb in size are considered submicroscopic structural variants. These recently discovered structural variants are thought to play a very significant role in phenotypic diversity and disease susceptibility.

Types of structural variants
Structural variation is an important type of human genetic variation that contributes to phenotypic diversity. There are microscopic and submicroscopic structural variants which include deletions, duplications, and large copy number variants as well as insertions, inversions, and translocations. These are several different types of structural variants in the human genome and they are quite distinctive from each other. A translocation is a chromosomal rearrangement, at the inter- or intra-chromosomal level, where a section of a chromosome changes position but with no change in the whole DNA content. A section of DNA that is larger than 1 kb and occurs in two or more copies per haploid genome, in which the different copies share greater than 90% of the same sequence, are considered to be segmental duplications or low-copy repeats. These are only a few of the several different types of structural variants that have been known to exist in the human genome. A table visualizing these different forms of structural variants, as well as others, is shown in Figure 1.

An inversion is a section of DNA on a chromosome that is reversed in its orientation in comparison to the reference genome. There have been many studies identifying inversions because they have been found to have a big role in many diseases. A study found that forty percent of haemophilia A patients had a factor 8 gene inversion of a certain region that was four hundred kb in size. The inversion breakpoint was found to be around a segmental duplication which is observed in many other inversion events.

It is difficult to completely understand how each structural variant is created. It was previously known that repeated sequences on a chromosome increases the probability of non allelic homologous recombination. These repeated sequences could cause deletions, duplications, inversions, and inverted duplication chromosomes. The products of this mechanism from the sequence repeats is depicted in Figure 2. A study was done on the olfactory receptor gene clusters where they questioned if there was an association between normal rearrangement of 8p and the repeated inverted sequences. The researchers observed that the rearrangement of chromosomes was actually caused by the homologous recombination in the 8p-reps. Therefore, they concluded that the substrate used in order to make rearrangements at the intrachromosomal level are the genes for olfactory receptors. This discovery revealed the role that inverted duplicates have in affecting the development of structural variants. The mechanisms and ways in which structural variants are produced are important to better understand the development of these type of genetic variants.



Copy-number variation
Copy-number variants are defined as sections of DNA that exist in a variable copy number when comparing it to the reference genome and are larger than 1 kb in size. This definition is broad and includes deletions, duplications, and large copy number variants. If the copy number variant is present in 1% or more of the population then it is also considered a copy-number polymorphism. There was a study on the global variation in copy number in the human genome which questioned the characteristics of copy number variants in the human genome. It was known that copy number variation in the human genome is important but at this point of time, it had not yet been fully understood. Human genome variation itself is very diverse as there are many types including inversions, duplication, SNPs, and other forms. They surveyed the genomes of 270 individuals, from a variety of populations, for copy number variants with technologies such as SNP arrays. Their results showed that many copy number variants had specific arrangements of linkage disequilibrium which revealed the copy number variation in all of the different populations. The study concluded that twelve percent of the genome contained CNVRs. They were found to be involved in more of the DNA in each genome than single nucleotide polymorphisms. This was a remarkable discovery since single nucleotide polymorphisms have been known to be the greatest in number in the human genome. In terms of size, however, these type of structural variants were found to have a larger presence in the human genome.

The copy number variants continued to be studied as several studies continued to reveal the depth of their presence and their significance. A study was conducted that questioned the role of the organization of copy number variants and wondered what type of duplications they are. It was known that copy number variation plays a big role in many human diseases but at the time large scale studies of these duplications had not been done. They decided to sequence 130 breakpoints from 112 individuals that contained 119 known CNVs by doing whole genome sequencing as well as next generation sequencing. They found that tandem duplications comprised 83% of the CNVs while 8.4% were triplications, 4.2% were adjacent duplications, 2.5% were insertional translocations, and 1.7% were other complex rearrangements. The copy number variants were predominantly tandem duplications which made it the most common type of copy number variant in the human genome according to the results of the study on this population. More was needed on the mechanistic side of the formation of structural variants. There was a study that focused on the mechanisms of very interesting and rare pathogenic copy number variants. The researchers knew that copy number variation is important in genome structural variation and contributes to human genetic disease but the actual mechanisms of most of the new and few pathogenic copy number variants had not been known. They used sequencing technologies to sequence breakpoint areas of many rare pathogenic copy number variants which was the biggest and most in depth analysis of copy number variants. They saw that the genomic architectural features were very important in the human genome and they were associated with about eighty-one percent of breakpoints. They concluded that tandem duplications and microdeletions that are rare and pathogenic do not happen in the human genome by chance. Instead, they arise from many different genomic architectural features. It was a very interesting result in that the certain architectural features of the genome physically made it possible and probable to develop certain rare and pathogenic structural variants.

Structural variation can be seen as an avenue of genome modification for adaptation by evolution. A study was conducted on ancestral diet and the evolution of the human amylase gene copy number. The consumption of starch became a huge component of the human diet with the development of agricultural societies. Amylase is the enzyme that breaks down starch and its copy number varies. These observations led to the question of whether or not the differences in starch consumption between different populations created natural selection pressures on the enzyme amylase. They tested for the differences in the amylase protein expression in saliva from different populations and compared their expression to their copy number in their respective genomes. Then they compared the starch consumption of different populations to their copy number of the amylase gene. They found that there was more amylase protein expression in saliva from people that had higher amylase copy number in their genome and there was also an association between groups of people with high starch diets and a larger amylase gene copy number. This study brought exciting results as structural variation proved an involvement in the evolution of the human population by increasing its amylase copy number over time.

The 1000 genomes project was able to successfully produce the DNA sequence of the human genome. They provided much sequencing data from many populations to analyze as well as a reference human genome for comparison and future studies. One study took advantage of this resource to question the structural variation differences between genomes from whole genome sequence data. It was known that human diseases are affected by duplications and deletions and that copy number analysis is common but multiallelic copy number variants (mCNVs) were not as well studied. The researchers got their data from the 1000 genomes project and analyzed 849 different genomes from a variety of populations that were sequenced in order to find large mCNVs. From their analysis, they found that mCNVs create most genetic variation in gene dosage compared to other structural variants and that the gene expression variation is created by the dosage diversity of genes created by mCNVs. The study underlined the great significance that structural variants, especially mCNVs, have on gene dosage which leads to variable gene expressions and human phenotypic diversity in the population.

Charcot-Marie Tooth (CMT) disease
There are several structural variants in the human genome that have been observed but have not led to any obvious phenotypic effects. There are some, however, that play a role in gene dosage which could lead to genetic diseases or distinct phenotypes. Structural variants can directly affect gene expression, such as with copy-number variants, or indirectly through position effects. These effects can have significant implications in susceptibility to disease. The first gene dosage effect that was observed, and considered to be an autosomal dominant disease from an inherited DNA rearrangement, was Charcot-Marie Tooth (CMT) disease. Most of the associations found with CMT were with a 1.5 Mb tandem duplication in 17p11.2-p12 at the PMP22 gene. The proposed mechanism for the structural variation is shown in Figure 2. When an individual has three copies of the normal gene, it results in the disease phenotype. If the individual had only one copy of the PMP22 gene, on the other hand, the result was a clinically different hereditary neuropathy with liability to pressure palsies. The differences in gene dosage created vastly different disease phenotypes which revealed the significant role that structural variation has on phenotype and susceptibility to disease.

HIV susceptibility
Structural variation studies became increasingly popular due to the discovery of their possible roles and effects in the human genome. Copy number variation is a very important type of structural variation and has been studied extensively. A study on the influence of the CCL3L1 gene on HIV-1/AIDS susceptibility tested if the copy number of the CCL3L1 gene had any effect on an individual’s susceptibility to HIV-1/AIDS. They sampled several different individuals and populations for their CCL3L1 copy number and compared it to their HIV acquirement risk. They found that there is an association between higher amounts in the copy number of CCL3L1 and susceptibility to HIV and AIDS since individuals who were more prone to HIV had a low copy number of CCL3L1. This difference in copy number was shown to play a possibly significant role in HIV susceptibility due to this association. Another study that focused on the pathogenesis of human obesity tested if structural variation of the NPY4R gene was significant in obesity. Studies had previously shown that 10q11.22 CNV had an association with obesity and that several copy number variants were associated with obesity. Their CNV analysis revealed that the NPY4R gene had a much higher frequency of 10q11.22 CNV loss in the patient population. The control population, on the other hand, had more CNV gain in the same region. This led the researchers to conclude that the NPY4R gene played an important role in the pathogenesis of obesity due to its copy number variation. Studies involving copy number variation as well as other structural variants have brought new insights to the significant roles that structural variants play in the human genome.

Schizophrenia
The factors that contribute to the development of schizophrenia have been studied extensively. A very recent study was conducted on the mechanism and genes responsible for schizophrenia development. It had been previously shown that variation at an MHC locus was associated with the development of schizophrenia. This study found that the association is caused partly by the complement component 4 (C4) genes and therefore implying that allele variants of the C4 genes contribute to the development of schizophrenia. Linkage disequilibrium helped researchers identify which C4 structural variant an individual had by looking at the SNP haplotypes. The SNP haplotypes and the C4 alleles were linked which was why they were in linkage disequilibrium, meaning that they segregated together. A single structural C4 variant was associated with many different SNP haplotypes, but different SNP haplotypes where associated with only one C4 structural variant. This was due to the linkage disequilibrium which allowed the researchers to determine the C4 structural variant easily by looking at the SNP haplotype. Their data suggested this because the results showed that the structural variants of C4 express the C4A protein at different levels and this difference in higher C4A protein expressions were associated with higher rates of schizophrenia development. The different structural variant alleles of the same gene were shown to have different phenotypes and susceptibility to disease. These studies exhibit the breadth of the involvement and significance of structural variation on the human genome. Its importance is demonstrated with its contribution to phenotypic diversity and disease susceptibility.

Future directions
Many studies have been conducted to better understand human genome structural variation. There have been great advances in the research but its significance is still not fully understood. There are several questions still left unanswered which beg for further studies on the subject. Current studies usually target “unique” areas of the genome but are not able to detect the phenotypic effect of structural variants in highly repetitive, duplicated, and complex genomic areas. It is very difficult to study this with the genomic technology of today but this may change with future development of sequencing technologies. In order to better understand the phenotypic effect of structural variants, large databases of genotypes and phenotypes of individuals must be created in order to make accurate associations. Huge projects such as Deciphering Developmental Disorders, UK10K, and International Standards for Cytogenomic Arrays Consortium have already paved the way to create databases for researchers to more easily pursue these studies.

In addition, there has been growth and development in technology to create induced pluripotent stem cells with specific diseases. This introduces appropriate model systems to recreate disease causing structural variants such as translocations, duplications, and inversions. The future advancement in technologies and large database efforts will help lead the way to better quality studies and a much better understanding of human genome structural variation.