Earth Microbiome Project

The Earth Microbiome Project (EMP) was an initiative founded by Janet Jansson, Jack Gilbert and Rob Knight in 2010 to collect natural samples and to analyze microbial life around the globe.

The EMP set out to process up to 200,000 samples in different biomes, creating a database of microbes on Earth to characterize environments and ecosystems by microbial composition and interaction.

The EMP website has not been updated in years and the project is believed to be closed.

Actors
The project was launched in 2010, and as of January 2018, it listed 161 institutions, all of which are universities and university-affiliated institutions except for IBM Research and the Atlanta Zoo. Crowdsourcing has come from the John Templeton Foundation, the W. M. Keck Foundation, the Argonne National Laboratory by the U.S. Dept. of Energy, the Australian Research Council, the Tula Foundation, and the Samuel Lawrence Foundation. Companies have provided in-kind support, including MO BIO Laboratories, Luca Technologies, Eppendorf, Boreal Genomics, Illumina, Roche and Integrated DNA Technologies.

Goals
The primary goal of the Earth Microbiome Project (EMP) has been to survey microbial composition in many environments across the planet, across time as well as space, using a standard set of protocols. The development of standardized protocols reduces variation and bias in analytical pipelines that complicates comparison of microbial community structures.

Another important goal is to determine how the reconstruction of microbial communities is affected by analytic biases. The rate of technological advancement is rapid, and it is necessary to understand how data using updated protocols will compare with data collected using earlier techniques. Information from this project will be archived in a database to facilitate analysis. Other outputs will include a global atlas of protein function and a catalog of reassembled genomes classified by their taxonomic distributions.

Methods
Standard protocols for sampling, DNA extraction, 16S rRNA amplification, 18S rRNA amplification, and "shotgun" metagenomics have been developed or are under development.

Sample collection
Samples will be collected using appropriate methods from various environments including the deep ocean, fresh water lakes, desert sand, and soil. Standardized collection protocols will be used when possible, so that the results are comparable. Microbes from natural samples cannot always be cultured. Because of this, metagenomic methods will be employed to sequence all the DNA or RNA in a sample in a culture-independent fashion.

Wet lab
The wet lab must to perform a series of procedures to select and purify the microbial portion of the samples. The purification process varies according to the type of sample. DNA will be extracted from soil particles, or microbes will be concentrated using filtration techniques. In addition, various amplification techniques may be used to increase DNA yield. For example, non-PCR based Multiple displacement amplification is preferred by some researchers. DNA extraction, the use of primers, and PCR protocols are all areas that, to avoid bias, need to be performed following carefully standardized protocols.

Sequencing
Researchers can sequence a metagenomic sample using two main approaches depending on the biological question. To identify the types and abundances of organisms present, the preferred approach is to target and amplify a specific gene, often that is highly conserved among the species of interest, often the 16S ribosomal RNA gene for bacteria and the 18S ribosomal RNA gene for protists. This approach is called "deep sequencing", which allows rare species to be identified in a sample. However, this approach will not enable assembly of any whole genomes, nor will it provide information on how organisms may interact with each other. The second approach is shotgun metagenomics, in which all the DNA in the sample is sheared and the fragments sequenced. In principle, this approach allows for the assembly of whole microbial genomes and inference of metabolic relationships. However, if most microbes are uncharacterized in a given environment, de novo assembly will be computationally expensive.

Data analysis
EMP proposes to standardize the bioinformatics aspects of sample processing.

Data analysis usually includes the following steps: 1) Data clean up. A pre-procedure to clean up any reads with low quality scores removing any sequences containing "N" or ambiguous nucleotides and 2) Assigning taxonomy to the sequences which is usually done using tools such as BLAST or RDP. Very often, novel sequences are discovered which cannot be mapped to existing taxonomy. In this case, taxonomy is derived from a phylogenetic tree which is created with the novel sequences and a pool of closely related known sequences.

Additional methods may be employed depending on the sequencing technology and the underlying biological question. For example, an assembly will be required if the sequenced reads are too short to infer any useful information. An assembly can also be used to construct whole genomes, providing useful information on the species. Furthermore, if the metabolic relationships within a microbial metagenome are to be understood, DNA sequences would need to be translated into amino acid sequences, for example with using gene prediction tools such as GeneMark or FragGeneScan.

Project output
The four key outputs from the EMP have been:
 * Regardless of their degree of conclusiveness, all primary data generated from the Earth Microbiome Project will be stored in a centralized database called the "Gene Atlas" (GA). The GA will have sequence data, annotations and environmental metadata. Both known and unknown sequences, i.e. "Dark Matter", will be included hoping that in time the unknown sequences may eventually be characterized.
 * Assembled genomes, annotated using an automated pipeline, will be stored in "Earth Microbiome Assembled Genomes" (EM-AG) in public repositories. These will enable comparative genomic analysis.
 * Interactive visualizations of the data will be provided through the "Earth Microbiome Visualization Portal" (EM-VIP), which will allow the relationship between microbial makeup, environmental parameters, and genomic function to be viewed.
 * Reconstructed metabolic profiles will be offered through "Earth Microbiome Metabolic Reconstruction" (EMMR).

Challenges
Large amounts of sequence data generated from analyzing diverse microbial communities are a challenge to store, organize and analyse. The problem is exacerbated by the short reads provided by the high-throughput sequencing platform that will be the standard instrument used in the EMP project. Improved algorithms, analysis tools, huge amounts of computer storage, and access to thousands of hours of supercomputer time will be necessary.

Another challenge is the large number of sequencing errors expected, and distinguishing them from actual diversity in the collected microbial samples. Next-generation sequencing technologies provide enormous throughput but lower accuracies than older sequencing methods. When sequencing a single genome, the intrinsic lower accuracy of these methods is more than compensated for by the ability to cover the entire genome multiple times in opposite directions from multiple start points, but this capability provides no improvement in accuracy when sequencing a diverse mixture of genomes.

Despite the issuance of standard protocols, systematic biases from lab to lab are expected. The need to amplify DNA from samples with low biomass will introduce additional distortions of the data. Assembly of genomes of even the dominant organisms in a diverse sample of organisms requires gigabytes of sequence data.

With the advancement in high-throughput sequencing technologies, many sequences are entering public databases with no experimentally determined function, but which have been annotated on the basis of observed homologies with a known sequence. The first known sequence is used to annotate the first unknown sequence, but a problem that has become prevalent in the public sequence databases, which the EMP must avoid, is that the first unknown sequence is being used to annotate the second unknown sequence and so on. Sequence homology is only a modestly reliable predictor of function.