User:The Database of Genomic Variants

The Database of Genomic Variants
 * The Database of Genomic Variants is an online resource which contains a curated catalog of structural variation data from peer-reviewed scientific studies. Structural variation, consisting of microscopic and submicroscopic changes in the DNA of individuals has recently been identified as an important class of genetic variation. This type of genetic alteration has implications in normal and disease biology. Common types of structural variation include small insertion/deletion polymorphisms (InDels), larger copy number variations (CNVs) and inversions.
 * The Database of Genomic Variants provides a comprehensive summary of structural variation in the genomes of healthy, clinically unaffected individuals.


 * The Database of Genomic Variants provides a useful catalog of control data for studies aiming to correlate genomic variation with phenotypic data. The database is continuously updated with new data from peer reviewed research studies. The Centre is located in the MaRS Discovery District in downtown Toronto, Canada.

History
Following the discovery of structural variation in 2004 as a common and pervasive source of genomic variation in control subjects, the database was launched to capture, display and integrate the data emerging in this field. As new technologies became available to scan the genomes of individuals at higher resolution, the number of samples and researchers working on this type of data grew quickly.

The application of array scanning technologies in clinical diagnostics helped to drive the standardization of ??? and the database of genomic variants became an important resource for this application.

With increased usage, and exponential growth in the number of studies..a partnership was established with groups at The European Bioinformatics Institute in Hinxton UK. To establish standardized terminology, fully stable systems were put into place that allowed ..>>

Research
Studies included in the database:

Data Submission
The Database of Genomic Variants is no longer accepting direct submission of data to DGV. We are currently part of a collaboration with two new archival CNV databases at EBI and NCBI, called DGVa and dbVAR, respectively. One of the changes to DGV as part of this collaborative effort is that we will no longer be accepting direct submissions, but rather obtain the datasets from DGVa (short for DGV archive). This will ensure that the three databases are synchronized, and will allow for an official accessioning of variants.

To proceed with the submission of data, we therefore recommend that you contact either DGVa or dbVAR and let them handle the archiving and accessioning of your data. Once the data is deposited in their system and the manuscript is published, we will import the variants into DGV.

For more information about this collaboration, please see the Correspondence published in Nature Genetics (PubMed ID: 20877315).

Frequently Asked Questions

 * How are the boundaries of CNVs identified and reported?


 * We report the start and end coordinates of the variant as reported in the original study. Depending on the method used for detection of the CNV the boundaries reported may be quite different from the actual underlying variant. This is obvious when looking at regions where a large number of different studies have reported the same variant. The data must therefore be interpreted with this in mind. An overlap between a reported CNV and a gene may therefore not be accurate, as the CNV may be much smaller than reported. Some studies also merge nearby variants into larger regions and this merging process may merge separate CNVs into one large variant.

Data from BAC clone CGH arrays:


 * Coordinates from studies using BAC arrays tend to overestimate the boundaries of CNVs. BACs are vectors containing large inserts of DNA generally in the range of 150-250Kb in size. Studies detecting CNVs using this approach always report the start and end of the BAC clones that give a result indicative of a variant. However, the BAC arrays are highly sensitive and variants as small as 20-30kb may be detected. A CNV of this size may therefore reside anywhere within the start and end coordinates of the clone, even through the actual variant is significantly smaller.

Data from SNP arrays and oligonucleotide CGH arrays:


 * The probes on oligo and SNP arrays are very short, and do therefore not suffer from the same bias as arrays with BAC clones. Overall, the boundaries from SNP arrays of high resolution tend to have more accurate boundary information, and are more likely to underestimate than overestimate the size of CNVs.


 * How is a gain or loss defined, and how can I get the variant frequency?

There are several things to consider when interpreting CNV data and CNV genotypes. It is important to keep in mind that CNV data is always relative. A CNV call can be relative to a specific reference sample, a pool of reference samples or relative to the reference assembly. Since different reference samples may have been used in different studies, what is called as a gain in one study may actually be called a loss in another.

Insertions and duplications:


 * Some gains in the database are annotated as only one base-pair in size. This means that there is an insertion into the reference sequence at that coordinate. The estimated size of the insertion is described in the detailed information page for the variant. When gains are not annotated as an insertion into the reference, the region that is highlighted represented the sequence that is duplicated. Importantly, most current technologies provide no information about the location of the duplicated sequence and it could theoretically be located anywhere in the genome. However, for most duplications that have been characterized in detail the additional copy has been found in tandem, or at least nearby, the original sequence.

CNV genotypes:


 * Another limitation of many studies to date is that they have not been able to correctly identify CNV genotypes. Calls are simply made as gains or losses relative to a given reference. The actual number of copies present, or whether gains or losses are homozygous or heterozygous can often not be accurately determined with existing tools. Therefore, the frequencies we report in the database are not allele frequencies, but just counts of gains and losses for each variant (which have to be interpreted in relation to the total sample size of the study).

Frequencies:


 * The frequency of a variation is defined by the authors and can be a relative measure compared to the number of samples tested, or if there is genotype data available, this could be represented as an allele frequency.


 * How do I compare the data in DGV to my patient cohort?

The database contains only data originally described in healthy controls. However, this does not mean the database should be used as a substitute for running a control set with your patient samples. The database is meant to serve as a guide. It will give information about whether there is a common variant in your region of interest. Just because a variant is annotated in the database does not mean that a similar variant cannot be disease causing in your patient sample. Similarly, a lack of variants in a specific region of the database does not necessarily mean there are no common variants at that locus. Factors such as probe coverage and resolution may differ significantly between platforms. Since the boundaries of variants reported in DGV are often inaccurate, it is also often difficult to know for sure if a variant found using a different experimental approach is the exact same as one annotated in DGV. Some of the older studies are also less reliable and did not include an estimation of the false discovery rate. The DGV therefore does contain data that represent false positives. As a rule of thumb, regions identified in many studies or by independent methods, are most likely real. Large variants identified in a single sample by a single study represent either extremely rare variants or may be false positives.


 * What types of filters are applied to the data before they are added to DGV?

The data undergoes a systematic review prior to inclusion in the database. We run a number of quality assurance steps to ensure high quality data is presented for users. Many of the processing steps may be dependent on the study or method applied, and some of the more common steps are outlined here. Study specific filters (request made by author to remove specific variants, variants detected in patient samples). If a study includes both cases and controls, we filter out all case-related data.

Chromosome Mapping:


 * Only variants mapped to one of the autosomes (1-22) or sex chromosomes (X,Y) are kept. Variants mapped to chrM, chrR, chr6_hap or chrUN are removed.

Variants mapped to chromosome Y in female samples are removed.

Merging:


 * For studies which have analysed multiple samples, DGV will merge sample level calls together that share a 70% reciprocal overlap measured by length and position.

Size/Location:


 * Copy number variants larger than (or equal to) 50bp and smaller than 3Mb are kept, and inversions larger than 10Mb are removed.

Variants which span gaps in the reference assembly are removed. Variants which correspond to Decipher Genomic Disorders are removed (> 70% shared length)


 * What if I want to look at the entries that have been filtered/removed from DGV?

You can obtain a GFF3 file of the filtered variants on the Downloads page, under the Filtered Variants heading. Can I just look at variants found in HapMap samples?

Using the Query tool, go to the samples tab and filter by cohort. By selecting the HapMap cohort, and the filter all option, only data derived from the HapMap samples will be presented.


 * Why are some variants mapped to hg18 but not hg19?

When the variation data was mapped to hg19, we did our best to come up with a process that would result in a low error rate, while maximizing the number of variants kept in hg19. Due to changes in the underlying assembly, some regions are re-arranged while others contain novel sequence, thus changing the structure of the region. In most cases the assembly hasn't changed enough to cause difficultly in remapping, but there are some regions where we could no longer map the variant accurately.


 * How do I cite the database?

The database was first described in Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004 Sep;36(9):949-51.

Funding Sources
DGV is funded by several agencies, including Genome Canada through the Ontario Genomics Institute, the Canadian Institutes of Health Research, and The Centre for Applied Genomics. Additionally, philanthropic donations are administered by The Hospital for Sick Children Foundation and The McLaughlin Centre.

Scientific Director
The Scientific Director of The Database of Genomic Variants is Dr. Stephen W. Scherer, Senior Staff Scientist in The Hospital for Sick Children's Research Institute, Director of the McLaughlin Centre,and a professor at the University Of Toronto.

Scientific Advisory Board
High-level scientific oversight of DGV's scientific mandate and operations is provided through an external Scientific Advisory Board (SAB). The SAB members are:


 * Dr. Nigel Carter, PhD (Chair), Wellcome Trust Sanger Institute
 * Dr. Deanna Church, National Center for Biotechnology Information
 * Dr. Lars Feuk, Uppsala University
 * Dr. Paul Flicek, European Bioinformatics Institute
 * Dr. David Ledbetter, Emory University
 * Dr. Stephen Scherer - The Hospital for Sick Children