Literature-based discovery

Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of knowledge extraction and automated hypothesis generation that uses papers and other academic publications (the "literature") to find new relationships between existing knowledge (the "discovery"). Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated.

LBD can help researchers to quickly discover and explore hypotheses as well as gain information on relevant advances inside and outside of their niches and increase interdisciplinary information sharing.

The most basic and widespread type of LBD is called the ABC paradigm because it centers around three concepts called A, B and C. It states that if there is a connection between A and B and one between B and C, then there is one between A and C which, if not explicitly stated, is yet to be explored.

History
The LBD technique was pioneered by Don R. Swanson in the 1980s. He hypothesized that the combination of two separately published results indicating an A-B relationship and a B-C relationship are evidence of an A-C relationship which is unknown or unexplored. He used this to propose fish oil as a treatment for Raynaud syndrome due to their shared relationship with blood viscosity. This hypothesis was later shown to have merit in a prospective study and he continually proposed other discoveries using similar methods.

Swanson linking
Swanson linking is a term proposed in 2003 that refers to connecting two pieces of knowledge previously thought to be unrelated. For example, it may be known that illness A is caused by chemical B, and that drug C is known to reduce the amount of chemical B in the body. However, because the respective articles were published separately from one another (called "disjoint data"), the relationship between illness A and drug C may be unknown. Swanson linking aims to find these relationships and report them.

Although the ABC paradigm is widely used, critics of the system have argued that much of science is not captured on simple assertions and it is rather built from analogies and images at a higher level of abstraction.

Systems
LBD comes generally in two flavours: open and closed discovery. In open discovery, only A is given. The approach finds Bs and uses them to return possibly interesting Cs to the user, thus generating hypotheses from A. With closed discovery, the A and C are given to the approach which seeks to find the Bs which can link the two, thus testing a hypothesis about A and C.

A number of systems to perform literature-based discovery have been developed over the years, extending the original idea of Don Swanson, and the evaluation of the quality of such systems is an active area of research. Some systems include web versions for increased user-friendliness. A common approach to many systems is the use of MeSH terms to represent scientific articles. This is used by the systems Manjal, BITOLA and LitLinker.

One well-known system within the field is called Arrowsmith and is tailored to find connections between two disjoint sets of articles, an approach labeled "two-node" search.

Another well-known system, LION LBD, uses PubTator for annotating PubMed scientific articles with concepts such as chemicals, genes/proteins, mutations, diseases and species; as well as sentence-level annotation of cancer hallmarks that describe fundamental cancer processes and behaviour. It uses co-occurrence metrics to rank relations between concepts and performs both open and closed discovery.

While LBD systems are based on traditional statistical methods, other systems leverage sophisticated machine learning methods, like neural networks. Some LBD systems represent the connection between concepts as a knowledge graph, and thus employ techniques of graph theory. The graph-based representation is also the foundation for LBD systems that employ graph databases like Neo4J, enabling discovery via graph query languages such as Cypher.

Graph-based LBD systems represent the relations between concepts using a different relation types, such as those in the UMLS Semantic Network. Some approaches go further and try to apply contextualized relations, an approach also used by the Gene Ontology for their Causal Activity Modeling (GO-CAM).

Use of databases
Besides extracting information from the body of scientific articles, LBD systems often employ structured knowledge from biocurated biological resources, like the Online Mendelian Inheritance in Men (OMIM).

List of systems
These are the published LBD systems, ordered by date of publication:


 * 1986 - Arrowsmith
 * 2000 - BITOLA V1
 * 2001 - DAD
 * 2003 - LitLinker
 * 2004 - ACS
 * 2004 - Manjal
 * 2004 - IRIDESCENT
 * 2005 - BITOLA V2
 * 2006 - LitLinker V2
 * 2007 - Arrowsmith V2
 * 2008 - Anni 2.0
 * 2008 - CoPub Discovery
 * 2009 - RajoLink
 * 2010 - Sem-BT
 * 2015 - Obvio
 * 2016 - Spark
 * 2017 - Mine the gap
 * 2019 - LION LBD

Semantic typing
A common task in literature-based discovery is assigning words/concepts to different semantic types. A concept might be classified under one type or multiple types. For example in the Unified Medical Language System (UMLS) the term migraine is classified under the type disease and syndrome, while the term magnesium is under two  types: biologically active substance and element, ion, or ''isotope.  The typing of concepts hones the discovery of connections between particular classes of concepts, i.e. diseases-genes or diseases-drugs''.  

System evaluation
The evaluation of literature-based discoveries is challenging, and includes both experimental and in silico methods. Methods try to quantify the amount of knowledge generated by systems, that should be provided in an amount and richness that is useful for scientists.

Evaluation is difficult in LBD for several reasons: disagreement about the role of LBD systems in research and thus what makes a successful one; difficulty in determining how useful, interesting or actionable a discovery is; and difficulty in objectively defining a ‘discovery’, which hinders the creation of a standard evaluation set which quantifies when a discovery has been replicated or found.

A popular method used in LBD is to replicate previous discoveries. These are usually LBD-based discoveries as they are relatively easy to quantify compared to other discoveries. There are only a handful of such discoveries and approaches tuned to perform well on these discoveries might not generalise. In this type of evaluation, the literature before the discovery to be replicated is used to generate a ranked list of discovery candidates as target or linking terms. Success is measured by reporting the rank of the term(s) of interest; the higher the rank, the better the approach.

Literature- or time-slicing involves splitting the existing literature at a point in time. The LBD system is then exposed to the literature before the split and is evaluated by how many of the discoveries in the later period it can discover. LBD systems have used term co-occurrences, relationships from external biomedical resources (e.g SemMedDB) and semantic relationships to generate the gold standards. A high precision approach is to get expert opinion to generate the gold standard, but this is time-consuming, expensive and tends to produce low recall rates.

The advantage of time-slicing in comparison to the replication of previous discoveries is the evaluation on a large number of test instances. This raises the need for evaluation metrics which can quantify performance on large, ranked lists. LBD works have used metrics popular in Information Retrieval which include Precision, Recall, Area Under the Curve (AUC), Precision at k, Mean Average Precision (MAP) and others.

The approach of Proposing new discoveries or treatments goes beyond replicating past discoveries or predicting time-sliced instances of a particular relationship and shows that a system is capable of being used in realistic situations. This is usually accompanied by peer-reviewed publication in the domain or vetting by a domain expert.

Text mining
The automation of literature-based discovery relies heavily on text mining.

The language in scientific articles often include ambiguities, and an important step for coeherent parsing of the literature is the extraction of the sense of each term in the context they are used, a task called Word-sense disambiguation (WSD). For example, terms for genes like CT (PCYT1A) called  and MR (NR3C2) can be confused with the acronyms for Computational Tomography and Magnetic Resonance, requiring sofisticated disambiguation systems. Terms are often reconciled to ontologies or other sources of unique identifiers, such as the Unified Medical Language System (UMLS). This process of mapping multiple different utterances to a single name or identifier is called normalization.

Life sciences
LBD has already been used in different ways to identify new connections between biomedical entities and new candidate genes and treatments for illnesses.

Drug discovery
LBD has seen use in drug development and repurposing as well as predicting adverse drug reactions.

The method of literature-based discovery has been used to search for treatments for a number of human diseases, including:


 * diabetic retinopathy
 * dilated cardiomyopathy
 * Parkinson's disease
 * prostate cancer
 * gastric cancer
 * multiple sclerosis

Gene and protein function discovery
The approach has also been used to propose relations of genes with particular diseases, like breast cancer.

In the context of systems vaccinology, it was used to identify proteins related to interferon gamma and that play a role in the response to vaccines.

It has also been used to propose mechanisms for currently used drugs.

Biomarker discovery
LBD has been explored as a tool to identify biomarkers for diagnostic and prognostic for diseases, e.g. for the risk of type 2 diabetes.

Other uses
Besides providing scientific hypotheses about the world, LBD has also been used to improve data analysis, via the automatic identification of possible confounding factors using the medical literature.

It has also been used to understand better disease etiology and the relation of different diseases, for example looking for the genes connecting myocardial infarction and depression, and connections between psychiatric and somatic diseases.

Beyond life sciences
LBD has mostly been deployed in the biomedical domain, but it has also been used outside of it as it has been applied to research into developing water purification systems, accelerating development of developing countries and identifying promising research collaborations.

Additional reading

 * Wilson, Patrick (1977). Public Knowledge, Private Ignorance: Toward a Library and Information Policy. Greenwood Publishing Group. p. 156. ISBN 0-8371-9485-7.