Comprehensive Antibiotic Resistance Database

The Comprehensive Antibiotic Resistance Database (CARD) is a biological database that collects and organizes reference information on antimicrobial resistance genes, proteins and phenotypes. The database covers all types of drug classes and resistance mechanisms and structures its data based on an ontology. The CARD database was one of the first resources that covered antimicrobial resistance genes. The resource is updated monthly and provides tools to allow users to find potential antibiotic resistance genes in newly-sequenced genomes.

Ontology
Each resistance determinant described by the CARD Antibiotic Resistance Ontology (ARO) must include a connection to each of three branches: Determinant of Antibiotic Resistance, Antibiotic Molecule and Mechanism of Antibiotic Resistance. CARD has recently also launched draft ontologies for both virulence and mobile genetic elements, which are in active development.

Curation
CARD curation occurs continuously, with monthly updates released by a team of biocurators. The curation process primarily involves regular review of the available scientific literature. Enforced curation guidelines provide the necessary context to ensure proper hierarchical classification, defined semantic relationships and data standardization. The biocuration team additionally annotates each ARO term with supplemental information from external references, including relevant publications, chemical structures or protein structure via the Protein Data Bank. ARO terms for AMR determinants are paired with an AMR detection model, which includes the nucleotide and peptide sequence retrieved from NCBI GenBank and any additional parameters needed for prediction of the determinant from raw DNA sequence. Curation is sometimes supplemented with de novo analyses, often to resolve problematic nomenclature.

Overall, CARD’s primary curation paradigm is as follows: to be included in CARD an AMR determinant must be described in a peer-reviewed scientific publication, with its DNA sequence available in GenBank, including clear experimental evidence of elevated minimum inhibitory concentration (MIC) over controls. AMR genes predicted by in silico methods, but not experimentally characterized, are not included in CARD’s primary curation. Yet, data harmonization efforts in 2019 that involved a comparison of ResFinder, ARG-ANNOT and NCBI’s catalog of β-lactamase alleles, revealed a large number of historical β-lactamases without associated peer-reviewed publication. As β-lactamases comprise nearly a third of ARO terms in CARD, that convention leads to each β-lactamase sequence variant being given a new name in the literature and missing β-lactamase reference sequences in CARD, leading to annotation imprecision by RGI and notable content differences between CARD and other databases. CARD now includes β-lactamase reference sequences and names even if they lack published experimental evidence of elevated MIC. This back-curation of older β-lactamase sequences is ongoing.

While a large part of CARD’s value is expert human biocuration of AMR sequence data and its relationship to antibiotics, with AMR publications in PubMed exceeding over 5000 per year for the last 10 years the task of keeping CARD both comprehensive and up-to-date is daunting. CARD addresses this problem using three approaches: ad hoc biocuration, pathogen AMR reviews, and computer-assisted literature triage. Ad hoc biocuration involves addressing feedback from the AMR research community as well as literature discovered during quality-control checks or review of AMR gene nomenclature. Pathogen AMR review involves systematic review of the AMR literature for specific pathogens. In 2017, the CARD*Shark text-mining algorithm was introduced for computer-assisted literature triage, which has been expanded based on the new ARO Drug Class classification tags. CARD*Shark assigns priority scores to publications from a general PubMed Medical Subject Headings (MeSH) search based on relevance and assigns the results to a CARD biocurator for manual review.

The CARD curation team continuously updates the database on a development server and prior to release, rigorous QC scripts are implemented to validate these data before porting it to the publicly available website. These QC steps verify the use of external identifiers, publication citations, AMR detection model parameters and imposed rules for the ontology structure. Any detected issues are resolved prior to release. After QC, the public CARD website is updated monthly. The website also includes a built-in BLAST instance for comparing sequences to CARD reference sequences and a web instance of RGI for resistome prediction with data visualization tools.

Community involvement
In response to the 2019 European Commission's Joint Research Centre (JRC) AMR Databases Workshop, the ‘AMR_Curation’ public repository was established for collective curation of AMR genes and mutations involving the majority of AMR database curators (e.g. NCBI, Resfinder, MEGARes, etc.) with an active and monitored curation issue tracker, a parallel AMR curation mailing list, editable Google Spreadsheet List of AMR Databases and Software, and curated Wikipedia list of AMR Databases all accessible at https://github.com/arpcard/amr_curation. CAD encourages researchers, software developers and AMR data curators to use this repository and associated resources to submit, discuss and resolve AMR curation issues. Anyone with a GitHub account can submit an issue.

AMR detection models
CARD’s Model Ontology includes reference nucleotide and protein sequences, as well as additional search parameters including mutations conferring AMR (if applicable) and curated BLAST(P/N) bit score cut-offs. The majority of CARD AMR determinants use either a protein homolog model (PHM) or a protein variant model (PVM). PHMs predict AMR protein sequences from raw DNA sequence based on homology to a curated reference sequence, based on a curated BLAST bit score cut-off. PVMs perform a similar search, but include additional parameters for the detection of specific curated non-synonymous mutations or other genetic variants (i.e. INDELs, frameshifts) that differentiate between antibiotic-susceptible wild-type and antibiotic-resistant alleles.

From 2017, CARD transitioned each detection model to curated BLAST bit score cut-offs, discontinuing use of less discriminatory BLAST expectation values (E). Bit score cut-offs are selected based on values that perform this discrimination when the curated reference sequence is compared by BLAST against CARD itself and against GenBank's non-redundant database, with hand inspection to determine a value that correctly classifies matches as homologs of similar antimicrobial function or similar proteins with different function or AMR Gene Family membership. The asymptotic nature of the BLAST expectation value (E) gave it very low discriminatory power between different β-lactamase gene families (nearly ⅓ of CARD’s content), but the linear nature of the BLAST bit score allowed this level of discrimination.