Data augmentation

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

Synthetic oversampling techniques for traditional machine learning
Synthetic Minority Over-sampling Technique (SMOTE) is a method used to address imbalanced datasets in machine learning. In such datasets, the number of samples in different classes varies significantly, leading to biased model performance. For example, in a medical diagnosis dataset with 90 samples representing healthy individuals and only 10 samples representing individuals with a particular disease, traditional algorithms may struggle to accurately classify the minority class. SMOTE rebalances the dataset by generating synthetic samples for the minority class. For instance, if there are 100 samples in the majority class and 10 in the minority class, SMOTE can create synthetic samples by randomly selecting a minority class sample and its nearest neighbors, then generating new samples along the line segments joining these neighbors. This process helps increase the representation of the minority class, improving model performance.

Data augmentation for image classification
When convolutional neural networks grew larger in mid-1990s, there was a lack of data to use, especially considering that some part of the overall dataset should be spared for later testing. It was proposed to perturb existing data with affine transformations to create new examples with the same labels, which were complemented by so-called elastic distortions in 2003, and the technique was widely used as of 2010s. Data augmentation can enhance CNN performance and acts as a countermeasure against CNN profiling attacks.

Data augmentation has become fundamental in image classification, enriching training dataset diversity to improve model generalization and performance. The evolution of this practice has introduced a broad spectrum of techniques, including geometric transformations, color space adjustments, and noise injection.

Geometric Transformations
Geometric transformations alter the spatial properties of images to simulate different perspectives, orientations, and scales. Common techniques include:


 * Rotation: Rotating images by a specified degree to help models recognize objects at various angles.
 * Flipping: Reflecting images horizontally or vertically to introduce variability in orientation.
 * Cropping: Removing sections of the image to focus on particular features or simulate closer views.
 * Translation: Shifting images in different directions to teach models positional invariance.

Color Space Transformations
Color space transformations modify the color properties of images, addressing variations in lighting, color saturation, and contrast. Techniques include:


 * Brightness Adjustment: Varying the image's brightness to simulate different lighting conditions.
 * Contrast Adjustment: Changing the contrast to help models recognize objects under various clarity levels.
 * Saturation Adjustment: Altering saturation to prepare models for images with diverse color intensities.
 * Color Jittering: Randomly adjusting brightness, contrast, saturation, and hue to introduce color variability.

Noise Injection
Injecting noise into images simulates real-world imperfections, teaching models to ignore irrelevant variations. Techniques involve:


 * Gaussian Noise: Adding Gaussian noise mimics sensor noise or graininess.
 * Salt and Pepper Noise: Introducing black or white pixels at random simulates sensor dust or dead pixels.

Data augmentation for signal processing
Residual or block bootstrap can be used for time series augmentation.

Biological signals
Synthetic data augmentation is of paramount importance for machine learning classification, particularly for biological data, which tend to be high dimensional and scarce. The applications of robotic control and augmentation in disabled and able-bodied subjects still rely mainly on subject-specific analyses. Data scarcity is notable in signal processing problems such as for Parkinson's Disease Electromyography signals, which are difficult to source - Zanini, et al. noted that it is possible to use a generative adversarial network (in particular, a DCGAN) to perform style transfer in order to generate synthetic electromyographic signals that corresponded to those exhibited by sufferers of Parkinson's Disease.

The approaches are also important in electroencephalography (brainwaves). Wang, et al. explored the idea of using deep convolutional neural networks for EEG-Based Emotion Recognition, results show that emotion recognition was improved when data augmentation was used.

A common approach is to generate synthetic signals by re-arranging components of real data. Lotte proposed a method of "Artificial Trial Generation Based on Analogy" where three data examples $$x_{1}, x_{2}, x_{3}$$ provide examples and an artificial $$x_{synthetic}$$ is formed which is to $$x_{3}$$ what $$x_{2}$$ is to $$x_{1}$$. A transformation is applied to $$x_{1}$$ to make it more similar to $$x_{2}$$, the same transformation is then applied to $$x_{3}$$ which generates $$x_{synthetic}$$. This approach was shown to improve performance of a Linear Discriminant Analysis classifier on three different datasets.

Current research shows great impact can be derived from relatively simple techniques. For example, Freer observed that introducing noise into gathered data to form additional data points improved the learning ability of several models which otherwise performed relatively poorly. Tsinganos et al. studied the approaches of magnitude warping, wavelet decomposition, and synthetic surface EMG models (generative approaches) for hand gesture recognition, finding classification performance increases of up to +16% when augmented data was introduced during training. More recently, data augmentation studies have begun to focus on the field of deep learning, more specifically on the ability of generative models to create artificial data which is then introduced during the classification model training process. In 2018, Luo et al. observed that useful EEG signal data could be generated by Conditional Wasserstein Generative Adversarial Networks (GANs) which was then introduced to the training set in a classical train-test learning framework. The authors found classification performance was improved when such techniques were introduced.

Mechanical signals
The prediction of mechanical signals based on data augmentation brings a new generation of technological innovations, such as new energy dispatch, 5G communication field, and robotics control engineering. In 2022, Yang et al. integrate constraints, optimization and control into a deep network framework based on data augmentation and data pruning with spatio-temporal data correlation, and improve the interpretability, safety and controllability of deep learning in real industrial projects through explicit mathematical programming equations and analytical solutions.