Genome sequencing of endangered species

Genome sequencing of endangered species is the application of Next Generation Sequencing (NGS) technologies in the field of conservation biology, with the aim of generating life history, demographic and phylogenetic data of relevance to the management of endangered wildlife.

Background
In the context of conservation biology, genomic technologies such as the production of large-scale sequencing data sets via DNA sequencing can be used to highlight the relevant aspects of the biology of wildlife species for which management actions may be required. This may involve the estimation of recent demographic events, genetic variations, divergence between species and population structure. Genome-wide association studies (GWAS) are useful to examine the role of natural selection at the genome level, to identify the loci associated with fitness, local adaptation, inbreeding, depression or disease susceptibility. The access to all these data and the interrogation of genome-wide variation of SNP markers can help the identification of the genetic changes that influence the fitness of wild species and are also important to evaluate the potential respond to changing environments. NGS projects are expected to rapidly increase the number of threatened species for which assembled genomes and detailed information on sequence variation are available and the data will advance investigations relevant to the conservation of biological diversity.

Non-computational methods
The traditional approaches in the preservation of endangered species are captive breeding and the private farming. In some cases those methods led to great results, but some problems still remain. For example, by inbreeding only few individuals, the genetic pool of a subpopulation remains limited or may decrease.

Phylogenetic analysis and gene family estimation
Genetic analyses can remove subjective elements from the determination of the phylogenetic relationship between organisms. Considering the great variety of information provided by living organisms, it is clear that the type of data will affect both the method of treatment and validity of the results: the higher the correlation of data and genotype, the greater is the validity likely to be. The data analysis can be used to compared different sequencing database and find similar sequences, or similar protein in different species. The comparison can be done using informatic software based on alignment to know the divergence between different species and evaluate the similarities.

NGS/Advanced sequencing methodologies
Since whole-genome sequencing is generally very data-intensive, techniques for reduced representation genomic approaches are sometimes used for practical applications. For example, restriction site-associated DNA sequencing (RADseq) and double digest RADseq are being developed. With those techniques researchers can target different numbers of loci. With a statistical and bioinformatic approach scientists can make considerations about big genomes, by just focusing on a small representative part of it.

Statistical and computational methods
While solving biological problems, one encounters multiple types of genomic data or sometimes an aggregate of same type of data across multiple studies and decoding such huge amount of data manually is unfeasible and tedious. Therefore, integrated analysis of genomic data using statistical methods has become popular. The rapid advancement in high throughput technologies allows researchers to answer more complex biological questions enabling the development of statistical methods in integrated genomics to establish more effective therapeutic strategies for human disease.

Genome crucial features
While studying the genome, there are some crucial aspects that should be taken in consideration. Gene prediction is the identification of genetic elements in a genomic sequence. This study is based on a combination of approaches: de novo, homology prediction, and transcription. Tools such as EvidenceModeler are used to merge the different results. Gene structure also have been compared, including mRNA length, exon length, intron length, exon number, and non-coding RNA.

Analysis of repeated sequences has been found useful in reconstructing species divergence timelines.

Genomic approach in gender determination
In order to preserve a specie, knowledge of the mating system is crucial: scientists can stabilize wild populations through captive breeding, followed by the release in the environment of new individuals. This task is particularly difficult by considering the species with homomorphic sex chromosomes and a large genome. For example, in the case of amphibians, there are multiple transitions among male and/or female heterogamety. Sometimes even variation of sex chromosomes within amphibian populations of the same specie were reported.

Japanese giant salamander
The multiple transitions among XY and ZW systems that occur in amphibians determine the sex chromosome systems to be labile in salamanders populations. By understanding the chromosomal basis of sex of those species, it is possible to reconstruct the phylogenetic history of those families and use more efficient strategies in their conservation.

By using the ddRADseq method scientists found new sex-related loci in a 56 Gb genome of the family Cryptobranchidae. Their results support the hypothesis of female heterogamety of this species. These loci were confirmed through the bioinformatic analysis of presence/absence of that genetic locus in sex-determined individuals. Their sex was established previously by ultrasound, laparoscopy and measuring serum calcium level differences. The determination of those candidate sexual loci was performed so as to test hypotheses of both female heterogamety and male heterogamety. Finally to evaluate the validity of those loci, they were amplified through PCR directly from samples of known-sex individuals. This final step led to the demonstration of female heterogamety of several divergent populations of the family Cryptobranchidae.

Dryas monkey and golden snub-nosed monkey


A recent study used whole-genome sequencing data to demonstrate the sister lineage between the Dryas monkey and vervet monkey and their divergence with additional bidirectional gene flow approximately 750,000 to approximately 500,000 years ago. With <250 remaining adult individuals, the study showed high genetic diversity and low levels of inbreeding and genetic load in the studied Dryas monkey individuals.

Another study used several techniques such as single-molecule real time sequencing, paired-end sequencing, optical maps, and high-throughput chromosome conformation capture to obtain a high quality chromosome assembly from already constructed incomplete and fragmented genome assembly for the golden snub-nosed monkey. The modern techniques used in this study represented 100-fold improvement in the genome with 22,497 protein-coding genes, of which majority were functionally annotated. The reconstructed genome showed a close relationship between the species and the Rhesus macaque, indicating a divergence approximately 13.4 million years ago.

Plants
Plants species identified as PSESP ("plant species with extremely small population") have been the focus of genomic studies, with the aim of determining the most endangered populations. The DNA genome can be sequenced starting from the fresh leaves by doing a DNA extraction. The combination of different sequencing techniques together can be used to obtain a high quality data that can be used to assembly the genome. The RNA extraction is essential for the transcriptome assembly and the extraction process start from stem, roots, fruits, buds and leaves. The de novo genome assembly can be performed using software to optimize assembly and scaffolding. The software can also be used to fill the gaps and reduce the interaction between chromosome. The combination of different data can be used for the identification of orthologous gene with different species, phylogenetic tree construction, and interspecific genome comparisons.

Limits and future perspectives
The development of indirect sequencing methods has to some degree mitigated the lack of efficient DNA sequencing technologies. These techniques allowed researchers to increase scientific knowledge in fields like ecology and evolution. Several genetic markers, more or less well suited for the purpose, were developed helping researchers to address many issues among which demography and mating systems, population structures and phylogeography, speciational processes and species differences, hybridization and introgression, phylogenetics at many temporal scales. However, all these approaches had a primary deficiency: they were all limited only to a fraction of the entire genome so that genome-wide parameters were inferred from a tiny amount of genetic material.

The invention and rising of DNA sequencing methods brought a huge contribution in increasing available data potentially useful to improve the field of conservation biology. The ongoing development of cheaper and high throughput allowed the production of a wide array of information in several disciplines providing conservation biologists a very powerful databank from which was possible to extrapolate useful information about, for example, population structure, genetic connections, identification of potential risks due to demographic changes and inbreeding processes through population-genomic approaches that rely on the detection of SNPs, indel or CNV. From one side of the coin, data derived from high throughput sequencing of whole genomes were potentially a massive advance in the field of species conservation, opening wide doors for future challenges and opportunities. On the other side all these data brought researchers to face two main issues. First, how to process all these information. Second, how to translate all the available information into conservation's strategies and practice or, in other words, how to fill the gap between genomic researches and conservation application.

Unfortunately, there are many analytical and practical problems to consider using approaches involving genome-wide sequencing. Availability of samples is a major limiting factor: sampling procedures may disturb an already fragile population or may have a big impact in individual animals itself putting limitations to samples' collection. For these reasons several alternative strategies where developed: constant monitoring, for example with radio collars, allow us to understand the behavior and develop strategies to obtain genetic samples and management of the endangered populations. The samples taken from those species are then used to produce primary cell culture from biopsies. Indeed, this kind of material allow us to grow in vitro cells, and allow us to extract and study genetic material without constantly sampling the endangered populations. Despite a faster and easier data production and a continuous improvement of sequencing technologies, there is still a marked delay of data analysis and processing techniques. Genome-wide analysis and big genomes studies require advances in bioinformatics and computational biology. At the same time improvements in the statistical programs and in the population genetics are required to make better conservation strategies. This last aspect work in parallel with prediction strategies which should take in consideration all features that determine fitness of a species.