User:LaurenNoorda/sandbox

Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or impractical. Instead, inexpensive weak labels are employed with the understanding that they are imperfect, but can nonetheless be used to create a strong predictive model.

Problem of labeled training data
Machine learning models and techniques are increasingly accessible to researchers and developers; the real-world usefulness of these models, however, depends on access to high-quality labeled training data. This need for labeled training data often proves to be a significant obstacle to the application of machine learning models within an organization or industry. This bottleneck effect manifests itself in various ways, including the following examples:

Insufficient quantity of labeled data

When machine learning techniques are initially used in new applications or industries, there is often not enough training data available to apply traditional processes. Some industries have the benefit of decades' worth of training data readily available; those that do not are at a significant disadvantage. In such cases, obtaining training data may be impractical, expensive, or impossible without waiting years for its accumulation.

Insufficient subject-matter expertise to label data

When labeling training data requires specific relevant expertise, creation of a usable training data set can quickly become prohibitively expensive. This issue is likely to occur, for example, in biomedical or security-related applications of machine learning.

Insufficient time to label and prepare data

Most of the time required to implement machine learning is spent in preparing data sets. When an industry or research field deals with problems that are, by nature, rapidly evolving, it can be impossible to collect and prepare data quickly enough for results to be useful in real-world applications. This issue could occur, for example, in fraud detection or cybersecurity applications.

Other areas of machine learning serve as alternatives to weak supervision insomuch as they are likewise motivated by the demand for increased quantity and quality of labeled training data but employ different high-level techniques to approach this demand. These alternative approaches include active learning, semi-supervised learning, and transfer learning.

Simple definition
The weak supervision setting requires the following components:


 * Unlabeled data $$X_u=x_1,...,x_N$$;
 * One or more weak supervision sources $$\tilde{p_i}(y\mid x), i=1:M$$ such that each one has:
 * A coverage set $$C_i$$, representing the set of points $$x$$over which it is defined
 * An accuracy, defined as the expected probability of the true label $$y^*$$over its coverage set, which is assumed to be $$<1.0$$

Applications of weak supervision
Applications of weak supervision are numerous and varied within the machine learning research community.

Stanford University researchers created Snorkel, an open-source system for quickly assembling training data through weak supervision. Snorkel employs the central principles of the data programming paradigm, in which developers create labeling functions, which are then used to programmatically label data, and employs supervised learning techniques to assess the accuracy of those labeling functions. In this way, potentially low-quality inputs can be used to create high-quality models.

Researchers at University of Massachusetts Amherst propose augmenting traditional active learning approaches by soliciting labels on features rather than instances within a data set.

Researchers at Johns Hopkins University propose reducing the cost of labeling data sets by having annotators provide rationales supporting each of their data annotations, then using those rationales to train both discriminative and generative models for labeling additional data.