Epiphenotyping

Epiphenotyping involves studying the relationship between DNA methylation patterns and phenotypic traits in individuals and populations to be able to predict a phenotype from a DNA methylation profile. In the following sections, the background of epiphenotyping, an overview of a general methodology, its applications, advantages, and limitations are covered.

Epigenetics refers to heritable changes in gene expression that are not changes in the underlying DNA sequence... DNA methylation is a key epigenetic mechanism, involving the addition of methyl (-CH3) groups to specific DNA regions, almost always at cytosine-guanine dinucleotides (CpG sites). CpG sites are DNA sequences where a cytosine nucleotide is followed by a guanine nucleotide connected by a phosphate group.

Background
The term epiphenotyping comes from integrating the two words "epigenetics" and "phenotyping". Epiphenotyping is the process of using genome-wide DNA methylation patterns to predict phenotypes. Through computational methods, epiphenotyping utilizes DNA methylation data to infer information about phenotypic traits such as gestational age, sex, cell composition, and genetic ancestry.

Importantly, epiphenotyping and epigenome-wide association studies (EWAS) are both approaches used in epigenetics research, however they focus on distinct aspects of epigenetic data analysis. Epiphenotyping is focused on inferring phenotypic information from epigenetic data and understanding the biological implications of epigenetic patterns. Whereas EWAS is a hypothesis-driven approach that is centered around identifying specific epigenetic markers associated with a particular disease status, environmental factor, and/or phenotype.

The term "epiphenotyping" was first introduced in a paper in 2023 that evaluated the use of epiphenotyping in DNA methylation array studies of the human placenta. It is worth noting that although this specific term may be recent, the method itself of using DNA methylation data to predict phenotypes has been utilized since 2011.

Epiphenotyping workflow
The following section goes through the general methodology employed by studies that generate epiphenotyping models.

Data collection
Firstly, researchers extract and purify DNA from samples of interest (e.g., blood or placental tissues). The DNA samples are then assayed on high-throughput technologies such as DNA methylation arrays (e.g., Illumina Infinium MethylationEPIC (850K) array) or with whole genome bisulfite sequencing (WGBS) to collect DNA methylation data. In addition to collecting biological samples, key biological variables (e.g., gestational age at birth, sex, and self-reported ethnicity) and technical variables (e.g., processing time and temperature at which samples are stored) are also collected.

Preprocessing
The raw DNA methylation data undergoes preprocessing steps to address technical variation and filter out noise or low-quality methylation probes. Normalization and batch correction also happen at this stage. R packages such as minfi, wateRmelon, and ewastools facilitate data quality control checks. For example, bisulfite conversion efficiency, array run quality, and average total fluorescence intensity are crucial measures to assess between samples. The use of data normalization algorithms (e.g., functional or quantile) and probe filtering have been shown to reduce variability in biological and technical variables between sets of DNA methylation data. Principal component analysis (PCA) is often applied to reduce the dimensionality of the data before proceeding to analyze it further.

Training model
In this stage a model is developed that tries to predict epiphenotypes from the DNA methylation data. The model predictions are compared with known phenotypic information for validation. From the preprocessed and normalized data, a portion of the dataset is used for training the model known as the training dataset. Epiphenotyping models have been developed that use linear regression to identify CpGs that are predictors of the phenotype of interest, while newer models have used machine learning techniques. Machine learning algorithms such as random forests or support vector machines are used to train models on the preprocessed DNA methylation data.

Applying the model
Once the models have been tested and shown to have high predictive power, they can be applied to new DNA methylation data to infer epiphenotypes. Sometimes generating the epiphenotypes is the final step of a study, but other times the epiphenotypes are generated to be used for further analysis and association studies. Epiphenotypes can be included as covariates in other models that look for associations between phenotypes and DNA methylation patterns.

Further analysis can also be done on the estimated epiphenotypes to look for potential associations between epiphenotypes and specific biological functions or disease processes. For instance, there may be CpGs that are highly predictive of a phenotype which could indicate which genes are important for the development of that phenotype. An example of this is examining how blood or placental cell composition relates to preeclampsia, or how certain CpGs predictive of epigenetic age correlate with gestational age discrepancies.

Epigenetic clocks
The most common use of epiphenotyping are epigenetic clocks that predict the age of a biological sample based on DNA methylation. Epigenetic changes, including changes in DNA methylation (overall global hypomethylation), are associated with cellular aging.

Epigenetic predicted chronological age generally increases with actual chronological age but the rate of epigenetic aging can vary across tissues and between individuals. Deviation of the predicted age and actual age is known as epigenetic age acceleration. Epigenetic age acceleration has been associated with several phenotypes including obesity, lung cancer incidence, and traumatic stress among others.

First-Generation Epigenetic clocks
The first published epigenetic clock that used DNA methylation to predict chronological age came from the Bocklandt group in 2011. This clock was generated with saliva samples and was based on 3 CpGs found in 3 gene promoters (EDARADD, TOM1L1, and NPTX2). Two years later, Steve Horvath generated epigenetic clocks for 51 different tissues and cell-types using 353 CpGs.

Second-Generation Epigenetic clocks
Since 2013 there has been an explosion of new epigenetic clocks with newer clocks including more CpG sites in their models and having higher predictive accuracy. Additionally, there have been both intrinsic and extrinsic epigenetic clocks developed.

Newer epigenetic clocks known as the "second generation" were developed that used additional clinical biomarkers (e.g. White blood cell count) to more accurately predict phenotypic age rather than chronological age of a sample from DNA methylation. The PhenoAge model trained on whole blood samples predicted phenotypic age using 513 CpGs, with improved prediction of mortality compared to "first generation" clocks like Horvath's. GrimAge is another model trained to use DNA methylation to predict the levels of 7 plasma proteins and lifespan/time-to-death. GrimAge has also been shown to be predictive of other phenotypes such as an individual's time-to-coronary heart disease, time-to-cancer, time-to-menopause, and cognitive performance. Both PhenoAge and GrimAge were developed for whole blood, whereas other clocks have been developed for other tissues such as the brain and skeletal muscle.

Epigenetic clocks have also been developed to predict the gestational age based on DNA methylation from the cord blood or placenta. Epigenetic age acceleration measured from placental samples has been associated with preeclampsia and maternal dislipidemia.

Forensic applications
Epigenetic clocks have the potential to be used in forensic science applications for estimating the chronological age of a biological sample. Epigenetic clocks could be used to estimate the age of an unidentified body, a perpetrators biological sample, or settle a legal dispute about an individual's age. Aside from age estimates, there are a number of other potential phenotypes that could be useful in a forensic setting that could be elucidated from DNA methylation: genetic ancestry, smoking, alcohol consumption, body size, socioeconomic status, and more.

Some limitations in the application of the technology to forensic science is that generating comprehensive methylome profiles is technically difficult for most forensic laboratories, individuals of young and old ages are poorly represented in most model datasets, ethical and legal issues, and low-quantity/quality DNA samples.

Epiphenotyping in oncology
It is known that cancer cells have altered DNA methylation patterns compared to normal cells. One application of epiphenotyping is to use models to predict an individual's risk of cancer, as well as in some cases DNA methylation patterns are used to diagnose certain cancer types. A variety of epiphenotyping models for oncology use have been developed for various cancers including breast, prostate, neurological, and lung among others. Another oncological application of epiphenotyping is for predicting the tissue-of-origin of a cancer from DNA methylation.

Other applications
Aside from cancer, epiphenotyping has been done to predict an individual's risk for other diseases including Alzheimer's disease and cardiovascular disease.

Another application of epiphenotyping includes predicting cell-type compositions within bulk tissue samples. There have also been some epiphenotyping models that predict genetic ancestry or sex from global DNA methylation patterns

Advantages and limitations
Epiphenotyping models are only as good as the data they used to train their models. Cohort age range, genetic ancestry make-up, and sex ratio can all affect how predictive the model will be on other data, their external validity. Sometimes the epiphenotyping models only work, or work better on certain age ranges, or for certain genetic ancestries, this is generally a result of the epiphenotyping model being overfit.

Methylation Risk scores (MRS)
While polygenic risk scores (PRS) represent an estimate of an individual's phenotype based on many genetic variants, methylation risk scores (MRS) similarly provide an estimate of an individual's phenotype based on the methylation at many CpGs. PRS are generated through Genome Wide Association Studies (GWAS) and analogously MRS are generated by EWAS. Some use the term MRS in a similar way to the term epiphenotyping, to predict a phenotype from DNA methylation data

Machine learning classification models like random forests identify discriminatory features in a dataset to generate MRS. For example, MRS have been developed to identify smokers versus non-smokers. Alternatively, MRS can be used as a covariate in other analyses to adjust for that phenotype and reduce confounding effects. For example, the predicted smoking status based on DNA methylation was used as a covariate in an EWAS study that identified associations between DNA methylation and schizophrenia