Disease informatics

Disease Informatics (also infectious disease informatics) studies the knowledge production, sharing, modeling, and management of infectious diseases. It became a more studied field as a by-product of the rapid increases in the amount of biomedical and clinical data widely available, and to meet the demands for useful data analyses of such data.

Considering infectious diseases contribute to millions of deaths every year, the ability to identify and understand disease diffusion is crucial for society to apply control and prevention measures. The knowledge gained by researchers in the field of disease informatics can be used to aid policymakers' decisions on issues such as spreading public awareness, updating the training of health professionals, and buying vaccines.

Aside from aiding in policymakers' decisions, the goals of disease informatics also include increased identification of biomarkers for transmissibility, improved vaccine design, and a deeper understanding of host-pathogen interactions, and the optimization of antimicrobial development.

Artificial intelligence
The use of artificial intelligence (AI) tools, such as machine learning and natural language processing (NLP), in disease informatics increase efficiency by automating and speeding up several data analysis processes. Advances with AI and increased accessibility of data aid in predictive modeling and public health surveillance. AI uses predictive modeling to examine vast data sets and forecast future outcomes to increase the ability to predict disease outbreaks and help guide public health treatments. AI also provides a valuable avenue by combining its ability of spatial modeling with geographic information system (GIS) data to uncover geographical patterns (for example disease clusters) to support data-driven decision-making for local-level predictions of disease diffusion. As the growth of AI continues, more advances for its use in disease informatics are expected to come.

Machine learning
Machine learning (ML) techniques aid the study of disease informatics with its capability to spatially and temporally predict the progression and transmission of infectious diseases. In disease informatics, ML algorithms are used to analyze extensive amounts of complex data sets to identify patterns across varying types of data such as demographics, electronic health records, environmental conditions, etc. The types of ML techniques commonly used are decision trees (decision tree model), random forests, support vector machines (support vector machine), and deep learning networks (deep learning). Using these tools, researchers can apply them to data sets (for example genomic data, social media posts, and health records) to make predictions about the potential sources of an outbreak, the likelihood of an individual contracting a certain disease, and forecasting the number of cases of a disease in a given region. ML models have proven to be just as accurate as traditional statistical methods (especially when multiple ML models are used concurrently) when it comes to predicting the spread and onset of diseases, according to numerous studies.

Text mining
The use of text mining has become a beneficial avenue for querying large amounts of data to aid in gene mapping and the analysis of genomes. This tool provides the ability to query medical databases for processes such as genomic mapping, by integrating the genomic and proteomic data to map the genes and highlight their interrelationships with various diseases. Retrieving data of targeted sequences can be done in two ways, through a similarity search or by keyword search. A similarity search (using software like BLAST (biotechnology) is performed by entering a known sequence as a query sequence to search for sequences that have similarities. A keyword search (public tools include SRS, Entrez, and ACNUC) uses annotations that define the features of genes, such as sequence positions, to retrieve the desired gene sequences being searched for.

Syndromic Surveillance
Through a process called syndromic surveillance (related to public health surveillance) data analysis methods can be successfully used to predict potential disease outbreaks by detecting timely, pre-diagnosis health indicators. Syndromic surveillance combines demographic data (age, gender, ethnicity, etc.) with patient visit data (admission status, chief complaint, type of office visit, etc.) that can be put through natural language processes to highlight potential predictors of an outbreak. Due to the time-sensitivity in predicting possible outbreaks, the use of chief complaint data is valuable as it is available much more quickly than formal diagnosis data from physicians' offices. The key to successfully harnessing surveillance data for disease informatics is to use more than one source. Other important sources that are commonly used synchronically include the following:


 * Over-the-counter drug (OTC) sales
 * Hospital admissions
 * Absenteeism rates from schools and workplaces
 * Lab test orders
 * Poison control centers' communications
 * Case report numbers

Accessibility concerns
The accuracy of these AI tools and techniques relies upon providing them with high-quality, comprehensive data. Accessibility and collection of such data is still an ongoing challenge because most of the data pulled is incomplete, noisy, and contains human errors (i.e. grammar, abbreviations, spelling) which means the data must undergo a thorough cleaning (data cleansing) before it is eligible to be used.

The data collected will also come from numerous sources (due to differences in data availability and governance) that use varying formatting and software, creating an issue of needing some form of standardized infrastructure to better integrate and manage data. The formation of a standardized taxonomy for data analysis and predictive modeling would facilitate research collaboration, accelerate decisions, and help select the right predictive models to be used.

One method being used is federated learning, which allows the AI to be trained across multiple different centers without the need for sharing raw data, keeping the data safe within its source. However, the same issues of different formatting and software to ensure model convergence still affect this approach as well, so algorithmic improvements are needed.

Another concern is the potential for bias and overfitting of the predictive models, which could lead to inaccurate predictions. Human error can still persist even using these tools to automate tasks, due to the fact that if the AI tools are trained incorrectly, they will produce inaccurate data. A relevant study suggests that implementing AI with wearable devices and other emerging technology in the future would benefit some of the challenges by providing real-time data for the models to use, which could lead to increased accuracy of the data in its raw form, creating less need to spend time cleaning the data, and allowing the models to make more accurate predictions.

Ethical concerns
A critical concern for using AI and predictive modeling in disease informatics is data security and privacy. The data sources being used (electronic health records, demographics, etc.) contain highly sensitive information that must be protected for all parties involved. Any models or techniques being used need to be in compliance with local governmental regulations and laws such as HIPAA in the United States. The data used must also undergo rigorous data anonymization and de-identification protocols to protect patient privacy.

Through the further use and growth of explainable AI, also referred to as XAI, (explainable artificial intelligence) researchers and all parties involved can ensure transparency and accountability when it comes to using data analysis and computational methods in the field of disease informatics. XAI provides explanations of how the algorithms being used work, why they were chosen, what knowledge they produce, and so on.