BioSamples

BioSamples (BioSD) is a database at European Bioinformatics Institute for the information about the biological samples used in sequencing.

It stores submitter-supplied metadata about the biological materials from which data stored in the National Center for Biotechnology Information’s (NCBI) primary data archives are derived. NCBI’s archives hosts data pertaining to diverse types of samples from many species, and as such the BioSample database is similarly diverse. Examples of a BioSample include a primary tissue biopsy, an individual organism or an environmental isolate.

The BioSamples database captures sample metadata in a structured way by encouraging use of controlled sample attribute field name vocabularies. This metadata is key in giving the sample data context, allowing it to be more fully understood, reused, and enables aggregation of disparate data sets.

Sample metadata is linked to relevant experimental data across many archival databases relieving submitter burden by enabling one-time submission of sample description. They then can reference that sample, when necessary, when making data deposits to other archives.

BioSample records are indexed and searchable, supporting cross-database queries by sample description.

History
The BioSamples database was launched in 2011 to help aggregate and standardise sample metadata. Historically, each archive had created its own convention for sample metadata collection. These usually were limited in their standardisation and had no method to indicate when the a sample was used across multiple data sets. In addition to this, there is a growing awareness amongst the research community that sample metadata is vital for understanding the underlying data. Further, chances for re-use, aggregation and integration of data are increased with improved metadata. The database was initially populated with existing descriptions extracted from SRA, EST, GSS and dbGaP. As of May 2013, the database hosts almost 2 million BioSample records encompassing 18,000 species.

Content
The BioSamples database has doubled in size since January 2012 when 1 million samples were described in the BioSamples database, as of October 2013 2,846,137 samples are available as 80,232 groups. The rapid growth is predominantly due to new data sources, and increased volume of data from existing sources. New data sources include 22,288 samples from The Cancer Genome Atlas, and 920,441 samples from the Catalogue of Somatic Mutation in Cancer (COSMIC). Attributes define the material under investigation using structured name: value pairs, for example:

After specifying the sample type, the user is presented with a list of required and optional attribute fields to fill in, as well as the opportunity to supply any number of custom descriptive attributes. The BioSample database is extendible in that new types and attributes can be added as new standards develop. In addition to BioSample type and attributes, each BioSample record also contains:

The full list and definitions of BioSample types and attributes is available for preview and download.

Data Access
There are a number of ways in which the database can be accessed. The initial release of BioSD to the public only provided access to the database through a web interface. This web interface was subsequently updated in November 2012 and then again in March 2013 following the EBI site-wide re-launch. In February 2013, a public Application Programming Interface (API) was released using a Representational state transfer (REST) system. In October 2013, as a part of the EBI's new RDF platform a SPARQL endpoint was released, providing access to the data in the RDF format. Additionally, the database can be downloaded through EBI's FTP service.

Web Interface
The web interface allows users to access the BioSD database through a web browser. It provides functionality for both searching by sample groups and by samples themselves. The search features incremental search to assist users by providing them with possible search terms as they type. Advanced search is provided and allows users to search by applying the binary terms, AND, OR and NOT, to their search terms. Additionally, a wildcard character can be used to match any combination of characters including no characters. A question mark character can also be used to match any single character. Examples of these can be seen in the following table:

The web interface also allows users to select search results and view further details of that search result. The detailed view provides further information and makes available a link to the assay database(s) from which the data was sourced. Ordering by columns is also provided.

Application Programming Interface
The API provides a suitable method for retrieving data in a programmatic way. It uses a RESTful system that allows users to query URI endpoints and receive XML as results. The API has URI endpoints for a number of different types of requests. These requests can be used to, find specific samples, find specific groups, search for groups, search for samples and to search for samples within a group.

SPARQL Endpoint
The SPARQL endpoint allows users to search the database in a more comprehensive way than the standard web interface whilst still being usable from a web browser. Through this interface, far more complex queries can be made to further enable users in their searches. However, there is an increased learning curve with this method of accessing the data. The SPARQL endpoint returns results in the RDF format which was initially designed with metadata in mind and is thus suited to the needs of BioSD.

Development
The development team forms a part of Helen Parkinson's team at EMBL-EBI and contains software engineers and web developers who are assisted with domain specific knowledge by ontologists and bioinformaticians.

The primary programming language used on the project is the Java programming language. To aid the development of the project, the development teams uses the integrated development environment, IntelliJ IDEA which is provided by JetBrains. Other tools used in the project include Bamboo for continuous integration and the management of software releases. Additionally, YourKit is a Java profiler which helps optimise and eliminate bugs in the BioSD project.

The project is developed as an open-source project with all source code being freely available on GitHub.

Funding
Currently the primary funding for the BioSD database development and maintenance is provided by the European Molecular Biology Laboratory (EMBL) core budget which is in turn funded by its 20 member countries. There has also been additional contributions from the European Commission in the form of a number of grants. Further funding has come from the Human Induced Pluripotent Stem Cells Initiative provided by the Wellcome Trust and the Medical Research Council and from the EBiSC Innovative Medicines Initiative.