User:West0856/sandbox

Uncharacterized protein C1orf106, sometimes referred to as hypothetical protein LOC55765, is a protein of unknown function that in humans is encoded by the C1orf106 gene. Less common gene aliases include FLJ10901 and MGC125608.

Location
In humans, the C1orf106, is located on the long arm of chromosome 1 at locus (1q32.1). It spans from 200,891,499 to 200,915,736 (24.238 kb) on the plus strand.

Gene Neighborhood
C1orf106 is flanked by G protein-coupled receptor 25 (upstream) and maestro heat-like repeat family member 3 (MROH3P ), a predicted downstream pseudogene. Ribosomal protein L34 pseudogene 6 (RPL34P6) is further upstream and kinesin family member 21B is further downstream.

Promoter
There are 7 predicted promoters for C1orf106, and experimental evidence suggests that isoform 1 and 2, the most common isoforms, are transcribed using different promoters. MatInspector, a tool available through Genomatix, was used to predict transcription factor binding sites within potential promoter regions. The transcription factors that are predicted to target the anticipated promoter for isoform 1 are expressed in a range of tissues. The most common tissues of expression are the urogenital system, nervous system and bone marrow. This coincides with expression data for the C1orf106 protein, which is highly expressed in the kidney and bone marrow. A diagram of the predicted promoter region, with highlighted transcription factor binding sites, is shown to the right. The factors that are predicted to bind to the promoter region of isoform 2 differ, and 12 of the top 20 predicted factors are expressed in blood cells and/or tissues of the cardiovascular system.

Expression
C1orf106 is expressed in a wide range of tissues. Expression data from GEO profiles is shown below. The sites of highest expression, are listed in the table. Expression is moderate in the placenta, prostate, testis, lung, salivary glands and dendritic cells. It is low in the brain, most immmune cells, the adrenal gland, uterus, heart and adipocytes. Expression data, from various experiments, found on GEO profiles suggests that C1orf106 expression is up-regulated in several cancers including: lung, ovarian, colorectal and breast.

Isoforms
Nine putative isoforms are produced from the C1orf106 gene, 7 of which are predicted to encode proteins. Isoform one and two, shown below, are the most common isoforms. Isoform 1, which is the longest, is accepted as the canonical isoform. It contains 10 exons, which encode a protein that is 677 amino acids long, depending on the source. Some sources report that the protein is only 663 amino acids due to the use of a start codon that is 42 nucleotides downstream. According to NCBI, this isoform has only been predicted computationally. This may be because the kozak sequence surrounding the downstream start codon is more similar to the consensus kozak sequence as shown in the table below. Softberry was used to obtain the sequence of the predicted isoform. Isoform 2 is shorter due to a truncated N-terminus. Both isoforms have an alternative polyadenylation site.

miRNA regulation
miRNA-24 was identified as a microRNA that could potentially target C1orf106 mRNA. The binding site, which is located in the 5' untranslated region is shown.

General Properties
Isoform 1, diagramed below, contains a DUF3338 domain, two low complexity regions and a proline rich region. The protein is arginine and proline rich, and has a lower than average amount of asparagine and hydrophobic amino acids, specifically phenylalanine and isoleucine. The isoelectric point is 9.58, and the molecular weight of the unmodified protein is 72.9 kdal. The protein is not predicted to have an N-terminal signal peptide, but there are predicted nuclear localization signals (NLS) and a leucine rich nuclear export signal.

Modifications
C1orf106 is predicted to be highly phosphorylated. Phosphoylation sites predicted by PROSITE are shown in the table below. NETPhos predictions are illustrated in the diagram. Each line points to a predicted phosphorylation site, and connects to a letter which represents either serine (S), threonine (T) or tyrosine (Y).



Structure
Coiled-coils are predicted to span from residue 130-160 and 200-260. The secondary composition was predicted to be about 60% random coils, 30% alpha helices and 10% beta sheets.

Interactions
The proteins with which the C1orf106 protein interacts are not well characterized. Text mining evidence suggests C1orf106 may interact with the following proteins: DNAJC5G, SLC7A13, PIEZO2, MUC19. Experimental evidence, from a yeast two hybrid screen, suggests the C1orf106 protein interacts with 14-3-3 protein sigma, which is an adaptor protein.

Homology
C1orf106 is well conserved in vertebrates as shown in the table below. Sequences were retrieved from BLAST and BLAT.

A graph of the sequence identity versus the time since divergence for the asteriked entries is shown below. The colors correspond to degree of relatedness (green = closely related, purple = distantly related).

Paralogs
Proteins that are considered to be C1orf106 paralogs are not consistent between databases. A multiple sequence alignment (MSA) of potentially paralogous proteins was made to determine the likelihood of a truly paralogous relationship. The sequences were retrieved from a BLAST search in humans with the C1orf106 protein. The MSA suggests the proteins share a homologous domain, DUF3338, which is found in eukaryotes. A portion of the multiple sequence alignment is shown below. Apart from the DUF domain (boxed in green), there was little conservation. The DUF3338 domain does not have any extraordinary physical properties, however, one notable finding is that each of the proteins in the MSA is predicted to have two nuclear localization signals. The proteins in the MSA are all predicted to localize to the nucleus. A comparison of the physical properties of the proteins was also conducted using SAPS and is shown in the table.



Clinical Significance
A total of 556 single nucleotide polymorphisms (SNPs) have been identified in the gene region of C1orf106, 96 of which are associated with a clinical source. Rivas et al. identified four SNPs, shown in the table below, that may be associated with inflammatory bowel disease and Crohn's disease. According to GeneCards, other disease associations may include multiple sclerosis and ulcerative colitis.