Micropeptide

Micropeptides (also referred to as microproteins) are polypeptides with a length of less than 100-150 amino acids that are encoded by short open reading frames (sORFs). In this respect, they differ from many other active small polypeptides, which are produced through the posttranslational cleavage of larger polypeptides. In terms of size, micropeptides are considerably shorter than "canonical" proteins, which have an average length of 330 and 449 amino acids in prokaryotes and eukaryotes, respectively. Micropeptides are sometimes named according to their genomic location. For example, the translated product of an upstream open reading frame (uORF) might be called a uORF-encoded peptide (uPEP). Micropeptides lack an N-terminal signaling sequences, suggesting that they are likely to be localized to the cytoplasm. However, some micropeptides have been found in other cell compartments, as indicated by the existence of transmembrane micropeptides. They are found in both prokaryotes and eukaryotes. The sORFs from which micropeptides are translated can be encoded in 5' UTRs, small genes, or polycistronic mRNAs. Some micropeptide-coding genes were originally mis-annotated as long non-coding RNAs (lncRNAs).

Given their small size, sORFs were originally overlooked. However, hundreds of thousands of putative micropeptides have been identified through various techniques in a multitude of organisms. Only a small fraction of these with coding potential have had their expression and function confirmed. Those that have been functionally characterized, in general, have roles in cell signaling, organogenesis, and cellular physiology. As more micropeptides are discovered so are more of their functions. One regulatory function is that of peptoswitches, which inhibit expression of downstream coding sequences by stalling ribosomes, through their direct or indirect activation by small molecules.

Identification
Various experimental techniques exist for identifying potential sORFs and their translational products. These techniques are only useful for identification of sORF that may produce micropeptides and not for direct functional characterization.

RNA sequencing
One method for finding potential sORFs, and therefore micropeptides, is through RNA sequencing (RNA-Seq). RNA-Seq uses next-generation sequencing (NGS) to determine which RNAs are expressed in a given cell, tissue, or organism at a specific point in time. This collection of data, known as a transcriptome, can then be used as a resource for finding potential sORFs. Because of the strong likelihood of sORFs less than 100 aa occurring by chance, further study is necessary to determine the validity of data obtained using this method.

Ribosome profiling (Ribo-Seq)
Ribosome profiling has been used to identify potential micropeptides in a growing number of organisms, including fruit flies, zebrafish, mice and humans. One method uses compounds such as harringtonine, puromycin or lactimidomycin to stop ribosomes at translation initiation sites. This indicates where active translation is taking place. Translation elongation inhibitors, such as emetine or cycloheximide, may also be used to obtain ribosome footprints which are more likely to result in a translated ORF. If a ribosome is bound at or near a sORF, it putatively encodes a micropeptide.

Mass spectrometry
Mass spectrometry (MS) is the gold standard for identifying and sequencing proteins. Using this technique, investigators are able to determine if polypeptides are, in fact, translated from a sORF.

Proteogenomic applications
Proteogenomics combines proteomics, genomics, and transciptomics. This is important when looking for potential micropeptides. One method of using proteogenomics entails using RNA-Seq data to create a custom database of all possible polypeptides. Liquid chromatography followed by tandem MS (LC-MS/MS) is performed to provide sequence information for translation products. Comparison of the transcriptomic and proteomics data can be used to confirm the presence of micropeptides.

Phylogenetic conservation
Phylogenetic conservation can be a useful tool, particularly when sifting through a large database of sORFs. The likelihood of a sORF resulting in a functional micropeptide is more likely if it is conserved across numerous species. However, this will not work for all sORFs. For example, those that are encoded by lncRNAs are less likely to be conserved given lncRNAs themselves do not have high sequence conservation. Further experimentation will be necessary to determine if a functional micropeptide is in fact produced.

Antibodies
Custom antibodies targeted to the micropeptide of interest can be useful for quantifying expression or determining intracellular localization. As is the case with most proteins, low expression may make detection difficult. The small size of the micropeptide can also lead to difficulties in designing an epitope from which to target the antibody.

Tagging with CRISPR-Cas9
Genome editing can be used to add FLAG/MYC or other small peptide tags to an endogenous sORF, thus creating fusion proteins. In most cases, this method is beneficial in that it can be performed more quickly than developing a custom antibody. It is also useful for micropeptides for which no epitope can be targeted.

In vitro translation
This process entails cloning the full-length micropeptide cDNA into a plasmid containing a T7 or SP6 promoter. This method utilizes a cell-free protein-synthesizing system in the presence of 35S-methionine to produce the peptide of interest. The products can then be analyzed by gel electrophoresis and the 35S-labeled peptide is visualized using autoradiography.

Databases and repositories
There are several repositories and databases that have been created for both sORFs and micropeptides. A repository for of small ORFs discovered by ribosome profiling can be found at sORFs.org. A repository of putative sORF-encoded peptides in Arabidopsis thaliana can be found at ARA-PEPs. A database of small proteins, especially encoded by non-coding RNAs can be found at SmProt.

Prokaryotic examples
To date, most micropeptides have been identified in prokaryotic organisms. While most have yet to be fully characterized, of those that have been studied, many appear to be critical to the survival of these organisms. Because of their small size, prokaryotes are particularly susceptible to changes in their environment, and as such have developed methods to ensure their existence.

Escherichia coli (E. coli)
Micropeptides expressed in E. coli exemplify bacterial environmental adaptations. Most of these have been classified into three groups: leader peptides, ribosomal proteins, and toxic proteins. Leader proteins regulate transcription and/or translation of proteins involved in amino acid metabolism when amino acids are scarce. Ribosomal proteins include L36 (rpmJ) and L34 (rpmH), two components of the 50S ribosomal subunit. Toxic proteins, such as ldrD, are toxic at high levels and can kill cells or inhibit growth, which functions to reduce the host cell's viability.

Salmonella enterica (S. enterica)
In S. enterica, the MgtC virulence factor is involved in adaptation to low magnesium environments. The hydrophobic peptide MgrR, binds to MgtC, causing its degradation by the FtsH protease.

Bacillus subtilis (B. subtilis)
The 46 aa Sda micropeptide, expressed by B. subtilis, represses sporulation when replication initiation is impaired. By inhibiting the histidine Kinase KinA, Sda prevents the activation of the transcription factor Spo0A, which is required for sporulation.

Staphylococcus aureus (S. aureus)
In S. aureus, there are a group of micropeptides, 20-22 aa, that are excreted during host infection to disrupt neutrophil membranes, causing cell lysis. These micropeptides allow the bacterium to avoid degradation by the human immune systems' main defenses.

Eukaryotic examples
Micropeptides have been discovered in eukaryotic organisms from Arabidopsis thaliana to humans. They play diverse roles in tissue and organ development, as well as maintenance and function once fully developed. While many are yet to be functionally characterized, and likely more remain to be discovered, below is a summary of recently identified eukaryotic micropeptide functions.

Arabidopsis thaliana (A. thaliana)
The POLARIS (PLS) gene encodes a 36 aa micropeptide. It is necessary for proper vascular leaf patterning and cell expansion in the root. This micropeptide interacts with developmental PIN proteins to form a critical network for hormonal crosstalk between auxin, ethylene, and cytokinin.

ROTUNDIFOLIA (ROT4) in A. thaliana encodes a 53 aa peptide, which localizes to the plasma membrane of leaf cells. The mechanism of ROT4 function is not well understood, but mutants have short rounded leaves, indicating that this peptide may be important in leaf morphogenesis.

Zea mays (Z. mays)
Brick1 (Brk1) encodes a 76 aa micropeptide, which is highly conserved in both plants and animals. In Z. mays, it was found to be involved in morphogenesis of leaf epithelia, by promoting multiple actin-dependent cell polarization events in the developing leaf epidermis. Zm401p10 is an 89 aa micropeptide, which plays a role in normal pollen development in the tapetum. After mitosis it also is essential in the degradation of the tapetum. Zm908p11 is a micropeptide 97 aa in length, encoded by the Zm908 gene that is expressed in mature pollen grains. It localizes to the cytoplasm of pollen tubes, where it aids in their growth and development.

Drosophila melanogaster (D. melanogaster)
The evolutionarily conserved polished rice (pri) gene, known as tarsal-less (tal) in D. melanogaster, is involved in epidermal differentiation. This polycistronic transcript encodes four similar peptides, which range between 11-32 aa in length. They function to truncate the transcription factor Shavenbaby (Svb). This converts Svb into an activator that directly regulates the expression of target effectors, including miniature (m) and shavenoid (sha), which are together responsible for trichome formation.

Danio rerio (D. rerio)
The Elabela gene (Ela) (a.k.a. Apela, Toddler) is important for embryogenesis. It is specifically expressed during late blastula and gastrula stages. During gastrulation, it is critical in promoting the internalization and animal-pole directed movement of mesendodermal cells. After gastrulation, Ela is expressed in the lateral mesoderm, endoderm, as well as the anterior, and posterior, notochord. Although it was annotated as a lncRNA in zebrafish, mouse, and human, the 58-aa ORF was found to be highly conserved among vertebrate species. Ela is processed by removal of its N-terminus signal peptide and then secreted in the extracellular space. Its 34-aa mature peptide serves as the first endogenous ligand to a GPCR known as the Apelin Receptor. The genetic inactivation of Ela or Aplnr in zebrafish results in heartless phenotypes.

Mus musculus (M. musculus)
Myoregulin (Mln) is encoded by a gene originally annotated as a lncRNA. Mln is expressed in all 3 types of skeletal muscle, and works similarly to the micropeptides phospholamban (Pln) in the cardiac muscle and sarcolipin (Sln) in slow (Type I) skeletal muscle. These micropeptides interact with sarcoplasmic reticulum Ca2+-ATPase (SERCA), a membrane pump responsible for regulating Ca2+ uptake into the sarcoplasmic reticulum (SR). By inhibiting Ca2+ uptake into the SR, they cause muscle relaxation. Similarly, the endoregulin (ELN) and another-regulin (ALN) genes code for transmembrane micropeptides that contain the SERCA binding motif, and are conserved in mammals.

Myomixer (Mymx) is encoded by the gene Gm7325, a muscle-specific peptide, 84 aa in length, which plays a role during embryogenesis in fusion and skeletal muscle formation. It localizes to the plasma membrane, associating with a fusogenic membrane protein, Myomaker (Mymk). In humans, the gene encoding Mymx is annotated as uncharacterized LOC101929726. Orthologs are found in the turtle, frog and fish genomes as well.

Homo sapiens (H. sapiens)
In humans, NoBody (non-annotated P-body dissociating polypeptide), a 68 aa micropeptide, was discovered in the long intervening noncoding RNA (lincRNA) LINC01420. It has high sequence conservation among mammals, and localizes to P-bodies. It enriches proteins associated with 5’ mRNA decapping. It is thought to interact directly with Enhancer of mRNA Decapping 4 (EDC4).

ELABELA (ELA) (a.k.a. APELA) is an endogenous hormone that is secreted as a 32 amino acid micropeptide by human embryonic stem cells. It is essential to maintain the self-renewal and pluripotency of human embryonic stem cells. Its signals in an autocrine fashion through the PI3/AKT pathway via an as yet unidentified cell surface receptor. In differentiating mesoendermal cells ELA binds to, and signals via, APLNR, a GPCR which can also respond to the hormonal peptide APLN.

The CYREN gene, conserved in mammals, when alternatively spliced is predicted to produce three micropeptides. MRI-1 was previously found to be a modulator of retrovirus infection. The second predicted micropeptide, MRI-2, may be important in non-homologous end joining (NHEJ) of DNA double strand breaks. In Co-Immunoprecipitation experiments, MRI-2 bound to Ku70 and Ku80, two subunits of Ku, which play a major role in the NHEJ pathway.

The 24 amino acid micropeptide, Humanin (HN), interacts with the apoptosis-inducing protein Bcl2-associated X protein (Bax). In its active state, Bax undergoes a conformational change which exposes membrane-targeting domains. This causes it to move from the cytosol to the mitochondrial membrane, where it inserts and releases apoptogenic proteins such as cytochrome c. By interacting with Bax, HN prevents Bax targeting of the mitochondria, thereby blocking apoptosis.

A micropeptide of 90aa, ‘Small Regulatory Polypeptide of Amino Acid Response’ or SPAAR, was found to be encoded in the lncRNA LINC00961. It is conserved between human and mouse, and localizes to the late endosome/lysosome. SPAAR interacts with four subunits of the v-ATPase complex, inhibiting mTORC1 translocation to the lysosomal surface where it is activated. Down-regulation of this micropeptide enables mTORC1 activation by amino acid stimulation, promoting muscle regeneration.