UniGene

UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry. Descriptions of the UniGene transcript based and genome based build procedures are available.

A detailed description of UniGene database
The UniGene  resource,  developed  at  NCBI,  clusters  ESTs  and  other mRNA sequences,  along  with coding sequences (CDSs) annotated on  genomic  DNA,  into subsets of related sequences. In most cases, each cluster is made up of sequences produced by  a single  gene,  including  alternatively  spliced  transcripts. However, some genes may be represented by more than one cluster. The clusters are organism specific and are currently available for human, mouse, rat, zebrafish, and cattle. They are built  in several  stages,  using  an automatic  process  based  on special  sequence comparison  algorithms. First, the  nucleotide  sequences  are  searched  for  contaminants,  such  as mitochondrial,  ribosomal,  and vector  sequence,  repetitive elements, and low-complexity  sequences. After a sequence is screened, it must contain at least 100 bases to be a candidate for entry into UniGene. mRNA and genomic DNA are clustered  first into  gene  links. A second sequence  comparison  links  ESTs  to each other and to the gene links. At this stage, all clusters are ‘‘anchored,’’  and contain either a sequence  with a polyadenylation  site or two ESTs labeled  as coming from the 3   end of a clone. Clone-based edges are added by linking  the 5  and 3   ESTs that  derive  from  the  same  clone. In some  cases,  this  linking  may  merge  clusters identified at a previous  stage. Finally, unanchored ESTs and gene clusters of size 1 (which may represent  rare transcripts)  are compared  with other UniGene clusters at lower stringency. The UniGene build is updated weekly, and the sequences that make up a cluster may change. Thus, it is not safe to refer  to a UniGene  cluster  by its cluster  identifier;  instead,  one  should  use  the  GenBank  accession  numbers  of the sequences  in the cluster.

As of July 2000, the human subset of UniGene contained 1.7 million sequences in 82,000 clusters; 98% of these clustered sequences were ESTs, and the remaining 2% were from mRNAs or CDSs annotated on genomic DNA. These human clusters could represent fragments of up to   82,000 unique human genes, implying that many human genes are now represented in a UniGene cluster. (This number is undoubtedly an overestimate of the number of genes in the human genome,  as some genes may be represented  by more than one cluster.)  Only 1.4% of clusters  totally lack ESTs, implying that most human genes are represented  by at least one EST. Conversely,  it appears  that the majority  of human  genes have been identified only by ESTs; only 16% of clusters  contain  either an mRNA or a CDS annotated  on a genomic  DNA. Because fewer ESTs are available for mouse, rat, and zebrafish, the UniGene clusters are not as representative of the unique genes in the genome. Mouse UniGene contains 895,000 sequences in 88,000 clusters, and rat UniGene contains 170,000 sequences in 37,000 clusters.

A new UniGene resource, HomoloGene, includes curated and calculated orthologs  and  homologs  for  genes  from  human,  mouse,  rat,  and  zebrafish. Calculated orthologs and homologs  are the result of nucleotide  sequence comparisons  between all UniGene clusters for each pair of organisms. Homologs are identified as the best match between a UniGene cluster in one organism  and a cluster in a second organism. When two sequences in different organisms  are best matches to one another (a reciprocal  best match), the UniGene clusters corresponding  to the pair of sequences are considered  putative orthologs. A special symbol indicates that UniGene clusters in three or more  organisms  share  a mutually  consistent  ortholog  relationship. The calculated orthologs and homologs are considered putative, since they are based only on sequence comparisons. Curated orthologs  are provided  by the Mouse  Genome Database (MGD) at the Jackson Laboratory  and the Zebrafish Information Database (ZFIN)  at the  University  of Oregon  and  can  also  be  obtained  from  the  scientific literature. Queries to UniGene  are entered  into a text box on any of the UniGene  pages. Query terms can be, for example, the UniGene  identifier, a gene name, a text term that is found somewhere  in the UniGene record, or the accession number of an EST or gene sequence in the cluster. For example, the cluster entitled ‘‘A disintegrin and metalloprotease domain 10’’ that contains the sequence for human ADAM10 can be retrieved  by entering  ADAM10,  disintegrin,  AF009615 (the GenBank  accession number of ADAM10),  or H69859 (the GenBank  accession number of an EST in the cluster). To query a  specific  part  of  the  UniGene  record,  use  the  @  symbol. For example, @gene(symbol) looks  for  genes  with  the  name  of the  symbol  enclosed  in the parentheses,   @chr(num) searches   for  entries   that  map  to  chromosome   num, @lib(id) returns entries in a cDNA library identified by id, and @pid(id) se- lects entries associated  with a GenBank  protein identifier id.

The query  results  page  contains  a list  of all UniGene  clusters  that  match  the query. Each cluster is identified by an identifier, a description, and a gene symbol, if available. Cluster identifiers are prefixed with Hs for Homo sapiens, Rn for Rattus norvegicus, Mm  for  Mus  musculus,  or  Dn  for  Danio  rerio. The descriptions  of UniGene  clusters  are taken from LocusLink,  if available,  or from the title of a sequence  in the cluster. The UniGene report page for each cluster links to data from other  NCBI  resources  (Fig.  12.5). At the top of the page are links  to LocusLink, which  provides  descriptive   information   about  genetic  loci  (Pruitt  et  al.,  2000), OMIM, a catalog of human genes and genetic disorders, and HomoloGene. Next are listed similarities between the translations  of DNA sequences in the cluster and protein  sequences  from  model  organisms,  including  human,  mouse,  rat, fruit fly, and worm. The subsequent section describes relevant mapping information. It is followed by ‘‘expression  information,’’  which  lists  the  tissues  from  which  the  ESTs  in the cluster have been created,  along with links to the SAGE database. Sequences making up the cluster are listed next, along with a link to download  these sequences.

It is important to note that clusters  that contain  ESTs only (i.e., no mRNAs  or annotated  CDSs) will be missing  some of these fields, such as LocusLink,  OMIM, and mRNA/Gene  links. UniGene titles for such clusters, such as ‘‘EST, weakly similar to ORF2 contains a reverse transcriptase domain [H. sapiens],’’ are derived from the title of a characterized protein  with which  the translated  EST sequence  aligns. The cluster  title  might  be  as  simple  as  ‘‘EST’’  if  the  ESTs  share  no  significant similarity with characterized  proteins.

Retirement of UniGene
On February 1, 2019, the NCBI announced that it was retiring the UniGene database because "reference genomes are available for most organisms with a sizable research community. Consequently, the usage of and need for UniGene has dropped significantly." Access to the UniGene builds will remain available through FTP.

Related databases

 * NCBI Gene database NCBI database cataloging individual genes
 * HomoloGene NCBI database which stores groups of homologous genes from different organisms