Contrast set learning

Contrast set learning is a form of association rule learning that seeks to identify meaningful differences between separate groups by reverse-engineering the key predictors that identify for each particular group. For example, given a set of attributes for a pool of students (labeled by degree type), a contrast set learner would identify the contrasting features between students seeking bachelor's degrees and those working toward PhD degrees.

Overview
A common practice in data mining is to classify, to look at the attributes of an object or situation and make a guess at what category the observed item belongs to. As new evidence is examined (typically by feeding a training set to a learning algorithm), these guesses are refined and improved. Contrast set learning works in the opposite direction. While classifiers read a collection of data and collect information that is used to place new data into a series of discrete categories, contrast set learning takes the category that an item belongs to and attempts to reverse engineer the statistical evidence that identifies an item as a member of a class. That is, contrast set learners seek rules associating attribute values with changes to the class distribution. They seek to identify the key predictors that contrast one classification from another.

For example, an aerospace engineer might record data on test launches of a new rocket. Measurements would be taken at regular intervals throughout the launch, noting factors such as the trajectory of the rocket, operating temperatures, external pressures, and so on. If the rocket launch fails after a number of successful tests, the engineer could use contrast set learning to distinguish between the successful and failed tests. A contrast set learner will produce a set of association rules that, when applied, will indicate the key predictors of each failed tests versus the successful ones (the temperature was too high, the wind pressure was too high, etc.).

Contrast set learning is a form of association rule learning. Association rule learners typically offer rules linking attributes commonly occurring together in a training set (for instance, people who are enrolled in four-year programs and take a full course load tend to also live near campus). Instead of finding rules that describe the current situation, contrast set learners seek rules that differ meaningfully in their distribution across groups (and thus, can be used as predictors for those groups). For example, a contrast set learner could ask, “What are the key identifiers of a person with a bachelor's degree or a person with a PhD, and how do people with PhD's and bachelor’s degrees differ?”

Standard classifier algorithms, such as C4.5, have no concept of class importance (that is, they do not know if a class is "good" or "bad"). Such learners cannot bias or filter their predictions towards certain desired classes. As the goal of contrast set learning is to discover meaningful differences between groups, it is useful to be able to target the learned rules towards certain classifications. Several contrast set learners, such as MINWAL or the family of TAR algorithms, assign weights to each class in order to focus the learned theories toward outcomes that are of interest to a particular audience. Thus, contrast set learning can be thought of as a form of weighted class learning.

Example: Supermarket Purchases
The differences between standard classification, association rule learning, and contrast set learning can be illustrated with a simple supermarket metaphor. In the following small dataset, each row is a supermarket transaction and each "1" indicates that the item was purchased (a "0" indicates that the item was not purchased):

Given this data,
 * Association rule learning may discover that customers that buy onions and potatoes together are likely to also purchase hamburger meat.
 * Classification may discover that customers that bought onions, potatoes, and hamburger meats were purchasing items for a cookout.
 * Contrast set learning may discover that the major difference between customers shopping for a cookout and those shopping for an anniversary dinner are that customers acquiring items for a cookout purchase onions, potatoes, and hamburger meat (and do not purchase foie gras or champagne).

Treatment learning
Treatment learning is a form of weighted contrast-set learning that takes a single desirable group and contrasts it against the remaining undesirable groups (the level of desirability is represented by weighted classes). The resulting "treatment" suggests a set of rules that, when applied, will lead to the desired outcome.

Treatment learning differs from standard contrast set learning through the following constraints:
 * Rather than seeking the differences between all groups, treatment learning specifies a particular group to focus on, applies a weight to this desired grouping, and lumps the remaining groups into one "undesired" category.
 * Treatment learning has a stated focus on minimal theories. In practice, treatment are limited to a maximum of four constraints (i.e., rather than stating all of the reasons that a rocket differs from a skateboard, a treatment learner will state one to four major differences that predict for rockets at a high level of statistical significance).

This focus on simplicity is an important goal for treatment learners. Treatment learning seeks the smallest change that has the greatest impact on the class distribution.

Conceptually, treatment learners explore all possible subsets of the range of values for all attributes. Such a search is often infeasible in practice, so treatment learning often focuses instead on quickly pruning and ignoring attribute ranges that, when applied, lead to a class distribution where the desired class is in the minority.

Example: Boston housing data
The following example demonstrates the output of the treatment learner TAR3 on a dataset of housing data from the city of Boston (a nontrivial public dataset with over 500 examples). In this dataset, a number of factors are collected for each house, and each house is classified according to its quality (low, medium-low, medium-high, and high). The desired class is set to "high", and all other classes are lumped together as undesirable.

The output of the treatment learner is as follows:

Baseline class distribution: low: 29% medlow: 29% medhigh: 21% high: 21% Suggested Treatment: [PTRATIO=[12.6..16), RM=[6.7..9.78)] New class distribution: low: 0% medlow: 0% medhigh: 3% high: 97%

With no applied treatments (rules), the desired class represents only 21% of the class distribution. However, if one filters the data set for houses with 6.7 to 9.78 rooms and a neighborhood parent-teacher ratio of 12.6 to 16, then 97% of the remaining examples fall into the desired class (high-quality houses).

Algorithms
There are a number of algorithms that perform contrast set learning. The following subsections describe two examples.

STUCCO
The STUCCO contrast set learner treats the task of learning from contrast sets as a tree search problem where the root node of the tree is an empty contrast set. Children are added by specializing the set with additional items picked through a canonical ordering of attributes (to avoid visiting the same nodes twice). Children are formed by appending terms that follow all existing terms in a given ordering. The formed tree is searched in a breadth-first manner. Given the nodes at each level, the dataset is scanned and the support is counted for each group. Each node is then examined to determine if it is significant and large, if it should be pruned, and if new children should be generated. After all significant contrast sets are located, a post-processor selects a subset to show to the user - the low order, simpler results are shown first, followed by the higher order results which are "surprising and significantly different. "

The support calculation comes from testing a null hypothesis that the contrast set support is equal across all groups (i.e., that contrast set support is independent of group membership). The support count for each group is a frequency value that can be analyzed in a contingency table where each row represents the truth value of the contrast set and each column variable indicates the group membership frequency. If there is a difference in proportions between the contrast set frequencies and those of the null hypothesis, the algorithm must then determine if the differences in proportions represent a relation between variables or if it can be attributed to random causes. This can be determined through a chi-square test comparing the observed frequency count to the expected count.

Nodes are pruned from the tree when all specializations of the node can never lead to a significant and large contrast set. The decision to prune is based on:
 * The minimum deviation size: The maximum difference between the support of any two groups must be greater than a user-specified threshold.
 * Expected cell frequencies: The expected cell frequencies of a contingency table can only decrease as the contrast set is specialized. When these frequencies are too small, the validity of the chi-square test is violated.
 * $$\chi^2$$ bounds: An upper bound is kept on the distribution of a statistic calculated when the null hypothesis is true. Nodes are pruned when it is no longer possible to meet this cutoff.

TAR3
The TAR3 weighted contrast set learner is based on two fundamental concepts - the lift and support of a rule set.

The lift of a set of rules is the change that some decision makes to a set of examples after imposing that decision (i.e., how the class distribution shifts in response to the imposition of a rule). TAR3 seeks the smallest set of rules which induces the biggest changes in the sum of the weights attached to each class multiplied by the frequency at which each class occurs. The lift is calculated by dividing the score of the set in which the set of rules is imposed by the score of the baseline set (i.e., no rules are applied). Note that by reversing the lift scoring function, the TAR3 learner can also select for the remaining classes and reject the target class.

It is problematic to rely on the lift of a rule set alone. Incorrect or misleading data noise, if correlated with failing examples, may result in an overfitted rule set. Such an overfitted model may have a large lift score, but it does not accurately reflect the prevailing conditions within the dataset. To avoid overfitting, TAR3 utilizes a support threshold and rejects all rules that fall on the wrong side of this threshold. Given a target class, the support threshold is a user-supplied value (usually 0.2) which is compared to the ratio of the frequency of the target class when the rule set has been applied to the frequency of that class in the overall dataset. TAR3 rejects all sets of rules with support lower than this threshold.

By requiring both a high lift and a high support value, TAR3 not only returns ideal rule sets, but also favors smaller sets of rules. The fewer rules adopted, the more evidence that will exist supporting those rules.

The TAR3 algorithm only builds sets of rules from attribute value ranges with a high heuristic value. The algorithm determines which ranges to use by first determining the lift score of each attribute’s value ranges. These individual scores are then sorted and converted into a cumulative probability distribution. TAR3 randomly selects values from this distribution, meaning that low-scoring ranges are unlikely to be selected. To build a candidate rule set, several ranges are selected and combined. These candidate rule sets are then scored and sorted. If no improvement is seen after a user-defined number of rounds, the algorithm terminates and returns the top-scoring rule sets.