User:GcentaPD/sandbox

The human microbiota is the aggregate of microorganisms that resides on or within any of a number of human tissues and biofluids, including the skin, mammary glands, placenta, seminal fluid, uterus, ovarian follicles, lung, saliva, oral mucosa, conjunctiva, biliary and gastrointestinal tracts. They include bacteria, archaea, fungi, protists and viruses. Though micro-animals can also live on the human body, they are typically excluded from this definition. The human microbiome refers specifically to the collective genomes of resident microorganisms.[1]

Humans are colonized by many microorganisms; the traditional estimate is that the average human body is inhabited by ten times as many non-human cells as human cells, but more recent estimates have lowered that ratio to 3:1 or even to approximately the same number.[2][3][4][5] Some microorganisms that colonize humans are commensal, meaning they co-exist without harming humans; others have a mutualistic relationship with their human hosts.[1]:700[6] Conversely, some non-pathogenic microorganisms can harm human hosts via the metabolites they produce, like trimethylamine, which the human body converts to trimethylamine N-oxide via FMO3-mediated oxidation.[7][8] Certain microorganisms perform tasks that are known to be useful to the human host but the role of most of them is not well understood. Those that are expected to be present, and that under normal circumstances do not cause disease, are sometimes deemed normal flora or normal microbiota.[1]

The Human Microbiome Project took on the project of sequencing the genome of the human microbiota, focusing particularly on the microbiota that normally inhabit the skin, mouth, nose, digestive tract, and vagina.[1] It reached a milestone in 2012 when it published its initial results.[9]

Terminology
Though widely known as flora or microflora, this is a misnomer in technical terms, since the word root flora pertains to plants, and biota refers to the total collection of organisms in a particular ecosystem. Recently, the more appropriate term microbiota is applied, though its use has not eclipsed the entrenched use and recognition of flora with regard to bacteria and other microorganisms. Both terms are being used in different literature.[6]

Relative numbers
As of 2014, it was often reported in popular media and in the scientific literature that there are about 10 times as many microbial cells in the human body as there are human cells; this figure was based on estimates that the human microbiome includes around 100 trillion bacterial cells and that an adult human typically has around 10 trillion human cells.[2] In 2014, the American Academy of Microbiology published a FAQ that emphasized that the number of microbial cells and the number of human cells are both estimates, and noted that recent research had arrived at a new estimate of the number of human cells – approximately 37.2 trillion, meaning that the ratio of microbial-to-human cells, if the original estimate of 100 trillion bacterial cells is correct, is closer to 3:1.[2][3] In 2016, another group published a new estimate of the ratio being roughly 1:1 (1.3:1, with "an uncertainty of 25% and a variation of 53% over the population of standard 70-kg males").[4][5]

Study
Main article: Human Microbiome Project Flowchart illustrating how the human microbiome is studied on the DNA level.

The problem of elucidating the human microbiome is essentially identifying the members of a microbial community which includes bacteria, eukaryotes, and viruses.[10] This is done primarily using DNA-based studies, though RNA, protein and metabolite based studies are also performed.[10][11] DNA-based microbiome studies typically can be categorized as either targeted amplicon studies or more recently shotgun metagenomic studies. The former focuses on specific known marker genes and is primarily informative taxonomically, while the latter is an entire metagenomic approach which can also be used to study the functional potential of the community.[10] One of the challenges that is present in human microbiome studies, but not in other metagenomic studies is to avoid including the host DNA in the study.[12]

Aside from simply elucidating the composition of the human microbiome, one of the major questions involving the human microbiome is whether there is a "core", that is, whether there is a subset of the community that is shared among most humans.[13][14] If there is a core, then it would be possible to associate certain community compositions with disease states, which is one of the goals of the Human Microbiome Project. It is known that the human microbiome (such as the gut microbiota) is highly variable both within a single subject and among different individuals, a phenomenon which is also observed in mice.[6]

On 13 June 2012, a major milestone of the Human Microbiome Project (HMP) was announced by the NIH director Francis Collins.[9] The announcement was accompanied with a series of coordinated articles published in Nature[15][16] and several journals in the Public Library of Science (PLoS) on the same day. By mapping the normal microbial make-up of healthy humans using genome sequencing techniques, the researchers of the HMP have created a reference database and the boundaries of normal microbial variation in humans. From 242 healthy U.S. volunteers, more than 5,000 samples were collected from tissues from 15 (men) to 18 (women) body sites such as mouth, nose, skin, lower intestine (stool), and vagina. All the DNA, human and microbial, were analyzed with DNA sequencing machines. The microbial genome data were extracted by identifying the bacterial specific ribosomal RNA, 16S rRNA. The researchers calculated that more than 10,000 microbial species occupy the human ecosystem and they have identified 81 – 99% of the genera.

Shotgun Sequencing
It is frequently difficult to culture in laboratory communities of bacteria, archaea and viruses, therefore sequencing technologies can be exploited in metagenomics, too. Indeed, the complete knowledge of the functions and the characterization of specific microbial strains offer a great potentiality in therapeutic discovery and human health.

Collection of samples and DNA extraction
The main point is to collect an amount microbial biomass that is sufficient to perform the sequencing and to minimize the sample contamination; for this reason, enrichment techniques can be used. In particular, the DNA extraction method must be good for every bacterial strain, not to have the genomes of the ones that are easy to lyse. Mechanical lysis is usually preferred rather than chemical lysis, and bead beating may result in DNA loss when preparing the library.

Preparation of the library and sequencing
The most used platforms are Illumina, Ion Torrent, Oxford Nanopore MinION and Pacific Bioscience Sequel, although the Illumina platform is considered the most appealing option due to its wide availability, high output and accuracy. There are no indications regarding the correct amount of sample to use.

Metagenome assembly
The de novo approach is exploited; however, it presents some difficulties to be overcome. The coverage depends on each genome abundance in its specific community; low-abundance genomes may undergo fragmentation if the sequencing depth is not sufficient enough to avoid the formation of gaps. Luckily, there are metagenome-specific assemblers to help, since, if hundreds of strains are present, the sequencing depth needs to be increased to its maximum.

Contig binning
Neither from which genome every contig derives, nor the number of genomes present in the sample are known a priori; the aim of this step is to divide the contigs into species. The methods to perform such analysis can be either supervised (database with known sequences) or unsupervised (direct search for contig groups in the collected data). However, both methods require a kind of metric to define a score for the similarity between a specific contig and the group in which it must be put, and algorithms to convert the similarities into allocations in the groups.

Analysis after the processing
The statistical analysis is essential to validate the obtained results (ANOVA can be used to size the differences between the groups); if it is paired with graphical tools, the outcome is easily visualized and understood.

Once a metagenome is assembled, it is possible to infer the functional potential of the microbiome. The computational challenges for this type of analysis are greater than for single genomes, due the fact that usually metagenomes assemblers have poorer quality, and many recovered genes are non-complete or fragmented. After the gene identification step, the data can be used to carry out a functional annotation by means of multiple alignment of the target genes against orthologs databases.

Marker gene analysis
It is a technique that exploits primers to target a specific genetic region and enables to determine the microbial phylogenies. The genetic region is characterized by a highly variable region which can confer detailed identification; it is delimited by conserved regions, which function as binding sites for primers used in PCR. The main gene used to characterize bacteria and archaea is 16S rRNA gene, while fungi identification is based on Internal Transcribed Spacer (ITS). The technique is fast and not so expensive and enables to obtain a low-resolution classification of a microbial sample; it is optimal for samples that may be contaminated by host DNA. Primer affinity varies among all DNA sequences, which may result in biases during the amplification reaction; indeed, low-abundance samples are susceptible to overamplification errors, since the other contaminating microorganisms result to be over-represented in case of increasing the PCR cycles. Therefore, the optimization of primer selection can help to decrease such errors, although it requires complete knowledge of the microorganisms present in the sample, and their relative abundances.

Marker gene analysis can be influenced by the primer choice; in this kind of analysis it's desirable to use a well-validated protocol (such as the one used in the Earth Microbiome Project). The first thing to do in a marker gene amplicon analysis is to remove sequencing errors; a lot of sequencing platforms are very reliable, but most of the apparent sequence diversity is still due to errors during the sequencing process. To reduce this phenomenon a first approach is to cluster sequences into Operational taxonomic unit (OTUs): this process consolidates similar sequences (a 97% similarity threshold is usually adopted) into a single feature that can be used in further analysis steps; this method however would discard SNPs because they would get clustered into a single OTU. Another approach is Oligotyping, which includes position-specific information from 16s rRNA sequencing to detect small nucleotide variations and from discriminating between closely related distinct taxa. These methods give as an output a table of DNA sequences and counts of the different sequences per sample rather than OTU.

Another important step in the analysis is to assign a taxonomic name to microbial sequences in the data. This can be done using machine learning approaches that can reach an accuracy at genus-level of about 80%. Other popular analysis packages provide support for taxonomic classification using exact matches to reference databases and should provide greater specificity, but poor sensitivity. Unclassified microorganism should be further checked for organelle sequences.

Phylogenetic Analysis
Many methods that exploit phylogenetic inference use the 16SRNA gene for Archea and Bacteria and the 18SRNA gene for Eukariotes. Phylogenetic comparative methods (PCS) are based on the comparison of multiple traits among microorganisms; the principle is: the closely they are related, the higher number of traits they share. Usually PCS are coupled with phylogenetic generalized least square (PGLS) or other statistical analysis to get more significant results. Ancestral state reconstruction is used in microbiome studies to impute trait values for taxa whose traits are unknown. This is commonly performed with PICRUSt, which relies on avaible databases. Phylogenetic variables are chosen by researchers according to the type of study: through the selection of some variables with significant biological informations, it is possible to reduce the dimension of the data to analyse.

Phylogenetic aware distance is usually performed with UniFrac or similar tools, such as Soresen's index or Rao's D, to quantify the differences between the different communities. All this methods are negatively affected by horizontal gene trasmission (HGT), since it can generate errors and lead to the correlation of distant species. There are different ways to reduce the negative impact of HGT: the use of multiple genes or computational tools to assess the probability of putative HGT events.