User:Jkbonfield/sandbox

Test of CRAM file format...

CRAM is a compressed columnar file format for storing biological sequences aligned to a reference sequence devised by Markus Hsi-Yang Fritz et al.

CRAM was designed to be an efficient referenced-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them.

File Format
The basic structure of a CRAM file is a series of containers, the first of which holds a compressed copy of the SAM header. Subsequent containers consists of a container compression header followed by a series of slices which in turn hold the alignment records themselves, formatted as a series of blocks. An alignment record can never be split over multiple slices, in contrast to BAM where a record may span multiple BGZF blocks.

CRAM constructs records from a set of data-series, describing the components of an alignment. The container compression header specifies which data series is encoded in which block, what codec will be used, and any codec specific meta-data (for example a table of huffman symbol code lengths). While data series can be mixed together within the same block, keeping them separate usually improves compression and provides opportunity for efficient selective decoding where only some data types are required.

Implementations of CRAM exist in htslib, htsjdk and JBrowse.

The file format specification is maintained by the Global Alliance for Genomics & Health (GA4GH) organisation, with the specification document available from the EBI cram toolkit page.

History
The initial paper (2011) describing the reference based format did not use the name CRAM. This software was implemented in Python as a prototype and demonstration of the basic concepts.

Versions 0.3 to 0.86: 2011 - 2012
Vadim Zalunin (European Bioinformatics Institute (EBI)) produced the first implementation named CRAM as a package called CRAMtools, written in the Java programming language.

Version 1.0: 2012
Version 1.0 was the first official release with a published specification. The initial implementation was CRAMtools (moved to a new repository), followed by a C implementation by James Bonfield (Wellcome Sanger Institute) in Io_lib's Scramble tool (part of the Staden Package) in 2013.

Versions 2.0, 2.1: 2013
Development of the second (C) implementation lead to a number of specification changes, which became 2.0. These included support for more than one reference per slice (useful with highly fragmented assemblies), better encoding of SAM auxiliary tags, splitting soft-clip and inserted bases into their own data-series, meta-data to track the number of records and bases per slice, and corrections to the BF (BAM flag) data-series.

Version 2.1 (2014) added EOF blocks, to help identify truncated files, but is otherwise identical.

Version 3.0: 2014
The primary improvements in CRAM 3.0 came from the inclusion of lzma and rANS codecs for block compression, along with multiple checksums for ensuring data integrity.