Draft:Covariate shift

Covariate Shift is a phenomenon in machine learning and statistics where the distribution of input features (covariates) changes between the training and test datasets, usually affecting the performance of a machine learning model. It is a common challenge faced in real-world applications, as models are often trained on historical data and expected to generalize to new, unseen data. Covariate shift can lead to decreased model performance or even model failure, as it violates the assumption that training and test data follow the same distribution.

Covariate shift is also referred to as domain shift and is a special case of dataset shift where only the covariates (inputs) are changing. That is, only $$P(X)$$ changes. This is distinct from both label shift (where $$P(Y)$$ changes) and concept drift (where $$P(Y|X)$$ changes).

Mathematical definition
Pure covariate shift occurs when the distribution of input features changes between the training and test data, while the conditional distribution of the target variable given the input features remains the same. Let $$P_{train}(X)$$ denote the distribution of input features in the training data and $$P_{test}(X)$$ denote the distribution in the test data. Covariate shift is defined as:

$$ P_{train}(X) \neq P_{test}(X) \quad \text{and} \quad P_{train}(Y|X=x) = P_{test}(Y|X=x), ; \forall x \in \mathcal{X} $$

where $$ X $$ represents the input features, $$ Y $$ represents the target variable, and $$ \mathcal{X} $$ is the feature space.

Measuring covariate shift
Covariate shift is usually measured using statistical distances, divergences, and two-sample tests. Some measurement methods work on continuous features, others on categorical features, and some on both. Additionally, some methods are capable of measuring univariate drift while others are capable of measuring multivariate drift.

Statistical distances

 * Maximum mean discrepancy (MMD) (Continuous): MMD is a kernel-based method that measures the distance between two probability distributions by comparing the means of their samples in a reproducing kernel Hilbert space. MMD provides a symmetric, non-negative measure of the difference between the training and test distributions, with higher values indicating a greater degree of covariate shift.
 * Wasserstein distance (Continuous and Categorical): Also known as the Earth mover's distance, the Wasserstein distance quantifies the difference between two probability distributions by measuring the minimum cost required to transform one distribution into the other. This metric provides a symmetric and non-negative measure of the divergence between the training and test distributions, with higher values indicating a more substantial degree of covariate shift.
 * Hellinger distance (Continuous and Categorical): The Hellinger distance is another symmetric measure of the difference between two probability distributions. It is derived from the Bhattacharyya coefficient, a measure of the similarity between two probability distributions. The Hellinger distance is defined as the square root of the sum of the squared differences between the square roots of the probabilities in the two distributions. Like other statistical distances, the Hellinger distance is non-negative, with higher values indicating a more significant divergence between the training and test distributions.
 * Jensen-Shannon Distance (Continuous and Categorical): The Jensen-Shannon Distance is derived from the JS divergence by applying a transformation to obtain a true distance metric that satisfies the properties of non-negativity, identity of indiscernibles, symmetry, and triangle inequality. Specifically, the Jensen-Shannon Distance is defined as the square root of the JS Divergence: $$JSDistance(P,Q) = \sqrt{JSDivergence(P,Q)}$$

Divergences

 * Kullback-Leibler (KL) Divergence (Continuous and Categorical): KL divergence is a measure of the difference between two probability distributions. It can be used to compare the training distribution q(x) and test distribution p(x), providing a non-negative value that quantifies the dissimilarity between the two distributions. A higher KL divergence value indicates a more significant degree of covariate shift. However, it is important to note that KL divergence is not symmetric, meaning the divergence from q(x) to p(x) may not be equal to the divergence from p(x) to q(x).
 * Jensen-Shannon (JS) Divergence (Continuous and Categorical): The JS Divergence is a symmetric measure of the difference between two probability distributions, derived from the Kullback-Leibler (KL) Divergence. It can be interpreted as the average of the KL divergences between each distribution and a mixture of the two distributions. The JS Divergence is non-negative, with higher values indicating a greater degree of dissimilarity between the training and test distributions. Unlike the KL Divergence, the JS Divergence is symmetric, providing a more consistent measure of the divergence between the distributions.

Two-sample tests

 * Kolmogorov-Smirnov test (Continuous and Categorical): The Kolmogorov-Smirnov test is a non-parametric statistical hypothesis test used to assess whether two samples come from the same underlying distribution. This test provides a p-value, which can be used to determine the presence of covariate shift. A small p-value (typically below a predetermined significance level, such as 0.05) indicates that the training and test distributions are significantly different, suggesting the presence of covariate shift.
 * Chi-Squared Test (Categorical): The Chi-Squared Test is a statistical method for detecting covariate shift in categorical features. It evaluates the association between the categorical variables representing the training and test distributions by comparing their observed frequencies in a contingency table to the expected frequencies under the assumption of independence. The test assesses the null hypothesis that there is no significant difference between the training and test distributions. If the null hypothesis is rejected, it suggests the presence of covariate shift. The Chi-Squared Test is applicable only for categorical variables and requires a sufficient sample size and minimum expected frequencies in the contingency table.

Python

 * SciPy: SciPy is an open-source library for the Python programming language, widely used for scientific computing and data analysis tasks. It provides tools for conducting statistical tests, such as the Chi-Squared Test and Kolmogorov-Smirnov Test and tools for calculating statistical distances and divergences, all of which can be utilized to detect the presence of covariate shift between training and test distributions.
 * NannyML: An open-source Python library for model monitoring that has functionality for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels. NannyML offers statistical tests, statistical distances and divergences.

Univariate vs. multivariate covariate shift
Covariate shift can occur in different forms depending on the number of features involved. Univariate covariate shift involves a single feature experiencing a change in distribution, whereas multivariate covariate shift can involve multiple features changing simultaneously or alterations in the correlation structure between features.

Univariate covariate shift
Univariate covariate shift occurs when the distribution of a single feature changes between the training and test datasets. As it involves only one dimension, univariate covariate shift is generally simpler to detect and address compared to its multivariate counterpart. Common techniques for detecting univariate covariate shift include statistical distances such as the Jensen-Shannon distance and Wasserstein (earth mover's) distance.

Multivariate covariate shift
Multivariate covariate shift arises when the distributions of multiple features change simultaneously between the training and test datasets or when the correlation structure between features is altered. The latter case, where the marginal distributions of individual features remain unchanged but the dependencies among them change, can be particularly challenging to detect and handle. In multivariate covariate shift, the complexity of the distribution shift and potential interactions between features require more advanced techniques for detection.

To address multivariate covariate shift, techniques such as Maximum Mean Discrepancy (MMD) with appropriate kernel functions that consider the relationships between multiple features can be employed.

Internal covariate shift
The term internal covariate shift was introduced in "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Internal covariate shift occurs when the distribution of the inputs of a given hidden layer in a neural network shifts due to the parameters of a previous layer changing. It is hypothesized that batch normalization can reduce internal covariate shift, however this is contested.

Difference between covariate shift and concept drift
Covariate shift and concept drift are two related but distinct phenomena in machine learning, both of which involve changes in the underlying data distribution. Covariate shift and concept drift can occur independently or simultaneously, and both can negatively impact the performance of machine learning models.

The main difference between covariate shift and concept drift is that covariate shift refers to changes in the distribution of input features between the training and test datasets, while concept drift involves changes in the relationship between input features and the target variable over time. In covariate shift, the underlying relationship between the features and the target remains constant, whereas, in concept drift, this relationship itself changes due to evolving processes or external factors.