Vertebrate Genome Annotation Project

The Vertebrate Genome Annotation (VEGA) database is a biological database dedicated to assisting researchers in locating specific areas of the genome and annotating genes or regions of vertebrate genomes. The VEGA browser is based on Ensembl web code and infrastructure and provides a public curation of known vertebrate genes for the scientific community. The VEGA website is updated frequently to maintain the most current information about vertebrate genomes and attempts to present consistently high-quality annotation of all its published vertebrate genomes or genome regions. VEGA was developed by the Wellcome Trust Sanger Institute and is in close association with other annotation databases, such as ZFIN (The Zebrafish Information Network), the Havana Group and GenBank. Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation features, non-coding regions and complex gene arrangements than automated methods.

History
The Vertebrate Genome Annotation (VEGA) database was first made public in 2004 by the Wellcome Trust Sanger Institute. It was designed to view manual annotations of human, mouse and zebrafish genomic sequences, and it is the central cache for genome sequencing centers to deposit their annotation of human chromosomes. Manual annotation of genomic data is extremely valuable to produce an accurate reference gene set but is expensive compared with automatic methods and so has been limited to model organisms. Annotation tools that have been developed at the Wellcome Trust Sanger Institute (WTSI) are now being used to fill that gap, as they can be used remotely and so open up viable community annotation collaborations. The HAVANA and VEGA Projects were run by Dr. Jennifer Harrow of the Wellcome Sanger Institute. VEGA has been archived since February 2017 and the HAVANA team moved to EMBL-EBI in June 2017.

Human genome
The Vega database is the central repository for the majority of genome sequencing centers to deposit their annotation of human chromosomes. Since the original VEGA publication, the number of human gene loci annotated has more than doubled to over 49,000 (September 2012 release), over 20,000 of which are predicted to be protein coding. The Havana Group as part of the consensus-coding sequence (CCDS) collaboration and whole-genome extension of the ENCODE project have fully manually annotated the human genome—which is available for reference, comparative analysis and sequence searches on the VEGA database. The final VEGA release was in February 2017 (release 68) and VEGA is now an archived site that will no longer be updated.

Other vertebrates
The VEGA database combines the information from individual vertebrate genome databases and brings them all together to allow easier access and comparative analysis for researchers. The human and vertebrate analysis and annotation (Havana) team at the Wellcome Trust Sanger Institute (WTSI) manually annotate the human, mouse and zebrafish genomes using the Otterlace/ZMap genome annotation tool. The Otterlace manual annotation system comprises a relational database that stores manual annotation data and supports the graphical interface, Zmap and is based on the Ensembl schema.

Zebrafish
The Zebrafish Genome, which is being fully sequenced and manually annotated. The Zebrafish genome currently has 18,454 annotated VEGA genes—of which 16,588 are projected protein-coding genes (September 2012, release).

Mouse
The Mouse genome currently has 23,322 annotated VEGA genes—of which 14,805 are projected protein-coding genes (June 2012, release). The loci chosen for manual annotation are spread throughout the genome, but some regions have received more focus than others: Chromosomes 2, 4, 11 and X, which have been fully annotated. The annotation shown in this release of Vega is from a datafreeze taken on 19 March 2012 and the gene structures are presented in the merged mouse geneset shown in Ensembl release 67. Vega also shows artificial loci generated by the mouse Knockout programs.

Pig
The Pig genome currently has annotated 2,842 VEGA genes—of which 2,264 are projected protein-coding genes (September 2012, release). The pig major histocompatibility complex (MHC), also known as the swine leukocyte antigen complex (SLA), spans a 2.4Mb region of submetacentric chromosome 7 (SSC7p1.1-q1.1). Implicated in the control of immune response and susceptibility to a range of diseases, the pig MHC plays a unique role in histocompatibility. Chromosomes X-WTSI and Y-WTSI are currently being annotated by Havana.

Dog, chimpanzee, wallaby, and gorilla
The Dog genome currently has 45 annotated VEGA genes—of which 29 are projected protein-coding genes (February 2005, release). The Chimpanzee genome currently has 124 annotated VEGA genes—of which 52 are projected protein-coding genes (January 2012, release). The Wallaby genome currently has 193 annotated VEGA genes—of which 76 are projected protein-coding genes (March 2009, release). The Gorilla genome currently has 324 annotated VEGA genes—of which 176 are projected protein-coding genes (March 2009, release).

Comparative analysis
In addition to full genomes, and unlike other browsers, VEGA also displays small finished regions of interest from genomes of other vertebrates, human haplotypes and mouse strains. Currently this comprises the finished sequence and annotation of the major histocompatibility complex (MHC) from different human haplotypes, and dog and pig [the latter of which is currently otherwise only available in very limited form in Ensembl Pre!. Additionally there is mouse NOD (non-obese diabetes) strain annotation of IDD (insulin-dependent diabetes) candidate regions and two more pig regions.

Vega contains comparative pairwise analysis between specific genomic regions from either different species or from different haplotypes / strains. This is in contrast to Ensembl where many all genome versus all genome comparisons are performed. The analysis in Vega involves:

1. The identification of genomic alignments using LastZ. 2. Prediction of the orthologue pairs using the Ensembl gene tree pipeline. Note that although the pipeline generates phylogenetic genetrees, the limited scope of the Vega comparative analysis means that these will necessarily be incomplete and consequently only orthologs are shown on the website. 3. The manual identification of alleles in either different human haplotypes or mouse strains.

There are five sets of analyses:

1. The MHC region has been compared between dog, pig (two assemblies), gorilla, chimpanzee, wallaby, mouse and eight human haplotypes: 2. Comparisons between the LRC regions of pig, gorilla and human (nine haplotypes): 3. The regions of the CL57BL/6 reference assembly used in these comparisons are: 4. Comparisons between three specific regions: 5. Pairwise comparisons between three pairs of full length mouse and human chromosomes:
 * dog chromosome 12-MHC
 * gorilla chromosome 6-MHC
 * chimpanzee chromosome 6-MHC
 * wallaby chromosome 2-MHC
 * pig chromosome 7 on Sscrofa10.2 (24.7Mb to 29.8Mbp)
 * pig chromosome 7-MHC
 * mouse chromosome 17 (33.3Mbp to 38.9Mbp)
 * chromosome 6 on the human reference assembly (28Mbp to 34Mbp)
 * chromosome 6 MHC region in the human COX, QBL, APD, DBB, MANN, MCF and SSTO haplotypes (full length chromosome fragments)
 * pig chromosome 6 (53.6Mbp to 54.0Mbp)
 * gorilla chromosome 19-LRC
 * human chromosome 19q13.4 (54.6Mbp to 55.6Mbp) on the reference assembly.
 * chromosome 19 LRC region in the COX_1, COX_2, PGF_1, PGF_2, DM1A, DM1B, MC1A and MC1B haplotypes (full length chromosome fragments).
 * Insulin dependent diabetes (Idd) regions on six mouse chromosomes (1, 3, 4, 6, 11 and 17) have been compared between the CL57BL/6 reference and one or more of the DIL Non-Obese Diabetic (NOD), CHORI-29 NOD, and the 129 strains. Further details are described here
 * Idd3.1: chromosome 3, clones AC117584.11 to AC115749.12
 * Idd4.1: chromosome 11, clones AL596185.12 to AL663042.5
 * Idd4.2: chromosome 11, clones AL663082.5 to AL604065.7
 * Idd4.2Q: chromosome 11, clones AL596111.7 to AL645695.18
 * Idd5.1: chromosome 1, clones AL683804.15 to AL645534.20
 * Idd5.3: chromosome 1, clones AC100180.12 to AC101699.9
 * Idd5.4: chromosome 1, clones AC123760.9 to AC109283.8
 * Idd6.1 + Idd6.2: chromosome 6, clones AC164704.4 to AC164090.3
 * Idd6.3: chromosome 6, clones AC171002.2 to AC163356.2
 * Idd9.1: chromosome 4, clones AL627093.17 to AL670959.8
 * Idd9.1M: chromosome 4, clones AL611963.24 to AL669936.12
 * Idd9.2: chromosome 4, clones CR788296.8 to AL626808.28
 * Idd9.3: chromosome 4, clones AL607078.26 to AL606967.14
 * Idd10.1: chromosome 3, clones AC167172.3 to AC131184.4
 * Idd16.1: chromosome 17, clones AC125141.4 to AC167363.3
 * Idd18.1: chromosome 3, clones AL845310.4 to AL683824.8
 * Idd18.2: chromosome 3, clones AC123057.4 to AC129293.9
 * pig chromosome 17 (58.2Mbp to 67.4Mbp)
 * human chromosome 20q13.13-q13.33 (45.8Mbp to 62.4Mbp)
 * mouse chromosome 2 (168.3Mbp to 179.0Mbp)
 * human chromosome 1 and mouse chromosome 4
 * human chromosome 17 and mouse chromosome 11
 * human chromosome X and mouse chromosome X