User:2601:14A:C101:1C88:6979:F44D:2445:3B85/sandbox

MMseqs2 (Many-against-Many sequence searching) is an open-source software (GPLv3 licensed) suite for fast similarity searches and clustering of protein sequences. MMseqs2 can compare a set of query protein sequences with a large database of protein sequences. It finds for each query protein those database proteins with similar sequences. Such sequences similarity searches are widely used in life science research to infer the functions and structures of the query proteins from those of similar proteins in the database. Often, the sensitivity of search tools is insufficient to find a similar sequence with annotated function or known structure, therefore sensitivity and not only speed of the search tools is important in practice. In October 2017, MMseqs2 was compared to state-of-the-art, fast sequence search methods and achieved the best combination of search sensitivity and search speed (see figure 1). In its iterative sequence profile search mode, MMseqs2 has increased sensitivity (50%-80%) at similar speed. Its sensitivity was up to 15% higher than that of PSI-BLAST at 400 times higher search speed.

MMseqs2 is the main utility to the create the bioinformatics resource Uniclust, which is the input to the homology and structure prediction method HH-suite.

Background
Due to the fast-dropping costs for next-generation sequencing, the public databases fill up with many protein sequences for which no functions are known, in particular from metagenomics sequencing projects. It is very time consuming and costly to determine the function or structure of a protein experimentally (see Protein methods). Therefore, the functions and structures of most protein sequences are predicted from only their amino acid sequence. For that purpose, the protein sequence is compared to the sequences of other proteins in public databases. If the sequence similarity search turns up a protein sequence so similar that the similarity is unlikely to have evolved by chance, then the two proteins are likely to be evolutionarily related ("homologous"). In that case, they are likely to still share similar structure and functions. Due to their speed at virtually no cost, sequence searches have therefore become a central tool for research in biology and molecular medicine.

Overview of MMseqs2
MMseqs2 is an open source software suite to search and cluster terabyte-sized protein sequence sets. In its iterative profile search mode, MMseqs2 achieves sensitivities to detect similar sequences beyond those of the popular BLAST and PSI-BLAST search tools at 400 times their speed.



Compared to its predecessor MMseqs, MMseqs2 is more sensitive, supports iterative profile-to-sequence searches and sequence-to-profile searches, offers enhanced functionality through new utilities.

MMseqs2 can run on multiple cores and servers, scaling almost linearly (Supplementary Figure 2 in ). It can also split and distribute large query or target databases automatically across several compute servers. This allows users to analyse with relatively modest computing resources databases with billions of sequences.

The MMseqs2 suite contains four main tools (workflows) for common searching and clustering tasks: These tools are bash workflows composed of some of the 90 utility tools in MMseqs2 and its four core tools for three sequence prefiltering (mmseqs prefilter), local sequence alignment (mmseqs align), and clustering (mmseqs clust). This design gives expert users flexibility to write their own customised workflows as simple bash scripts.

The prefilter core tool computes the similarities between all sequences in the query database with all sequences in a target database using a k-mer matching stage followed by an ungapped alignment. The align core tool implements a vectorized Smith-Waterman-alignment of all sequences that pass a cut-off for the ungapped alignment score in the prefilter tool.

The clustering core tool can cluster protein sequence sets into groups of similar sequences. It supports It takes as input the similarity graph obtained from the comparison of the sequence set with itself in the prefilter and align modules. Linclust is an independent workflow to cluster protein sequences in linear time. It is less sensitive but magnitudes faster than the mmseqs cluster workflow. The mmseqs cluster update workflow can efficiently update an existing sequence clustering by adding new sequences and removing deprecated ones without the need to compare all sequences with all others.

MMseqs2 as webserver and desktop application
MMseqs2 searches are also available via web servers or desktop application. This enables non-expert to perform fast searches. The web server is hosted at the Max Planck Institute for Biophysical Chemistry and facilitates searches against UniProt, SwissProt, Pfam and Protein Data Bank. The server can perform BLASTX-like translated, profile and protein sequence searches.

Another web service of MMseqs2 is available as part of the toolkit of the Max Planck Institute for Developmental Biology. Providing an interface to cluster protein sequences.

Application in metagenomics
Due to its high speed and its ability to process billions of sequences in one go, MMseqs2 is employed to improve sequence annotations and analysis in the fast-growing field of environmental genomics or metagenomics. In metagenomics, genetic sequences from microbes and viruses are sampled from the environment (human intestine, skin, soil, oceans, sewage etc.) and sequenced directly, without the need for previous cultivation of microbes. Due to the quick drop in sequencing costs for next generation sequencing, metagenomics is getting ever more powerful while sequence sets grow larger and more costly to analyse computationally.