User:Kkppz/sandbox

Gene
Coiled-coil domain containing 97 or CCDC97, also known as FLJ40267 and MGC20255, is a protein coding gene located at 19q13.2 on the plus strand with 6 exons. Orthologs for this gene can be found in mammals, reptiles, amphibians, birds, fish, and invertebrates. Transcriptional variant 1 with 3329 base pairs encodes the longer protein isoform containing 343 amino acids. , The CCDC97 protein isoform 1 has a molecular mass of ~39000 Da.

Transcription and Protein
This CCD97 gene is expressed at high levels, 2.4 time more than the average gene, and transcription produces 5 different mRNAs, 3 alternatively spliced variants, 2 unsliced forms and contains 3 non-overlapping alternative last exons and 5 alternative polyadenylation sites. 2 spliced and unspliced mRNA that are able to encode 4 good proteins resulting in 4 isoforms, 1 complete and 3 COOH complete, with some containing the Coiled-coil domain containing protein (DUF2052).

Evolution
Paralogs

No paralogs for CCDC97 found on NCBI.

Orthologs

Orthologs for CCDC97 can be found in most vertebrates as well as invertebrates. 20 orthologs from NCBI were collected and compared to CCDC97 that was found in humans by utilizing EMBOSS Needle and TimeTree. As the date of divergence increases, the sequence identity (%) decreases as expected going from mammals (89.8%-59.0%), reptiles (51.5%-51.1%), Aves (41.1%-37.7%), amphibians (48.1%-46.2%), fish (47.5%-40.7%), and invertebrates (28.2%).

Aves were the only group that did not follow the trend, suggesting that this gene has greatly mutated in birds. It is also important to note that amphibians and fish had similar sequence identities with some fish having higher sequence identity (%) values than amphibians. Invertebrates are the most distantly related from humans so the low sequence identity for the Caenorhabditis elegans of 28.2% was anticipate. CCDC97 is highly conserved and is found in both vertebrate and invertebrate. It most likely appeared in invertebrates around 700 million years ago because those are the last known organisms where the protein is present.

Promoter
The promoter sequence for the gene CCDC97 is located at chr19:41,309,673-41,310,813 (cite here and specify numbers).

RNA Sequencing
CCDC97 has ubiquitous expression in all human tissue types with a 6-fold variation between the lowest and highest expression. The highest levels of expression are found in the ovaries (RPKM 6.9), lymph node (RPKM 6.7), spleen (RPKM 6.2), appendix (RPKM 5.9), and endometrium (RPKM 5.3) when testis (RPKM 8.9) are excluded.