User:Floydbe/N-Prog

N-Prog is an algorithm for generating coarse-grained proactive diversity. In particular, it creates variants of an input program for use in an N-variant system. The primary use of N-variant systems generated by N-Prog is the detection of defects in the program at runtime; a secondary use could be the improvement of a program's test suite. N-Prog was developed in 2015 by Martin Kellogg, Benjamin Floyd, and Westley Weimer at the University of Virginia, along with Stephanie Forrest at the University of New Mexico.

Motivation
Errors in computer software are both pervasive and expensive. The cost of a defect typically increases with the length of time the defect goes unnoticed. Despite these high costs, mature, established software projects continue to ship with both known and unknown bugs, increasing risk of compromise and incorrect behavior. Consequently, techniques that help detect defects in software are of critical importance to the software engineering community. Many such techniques exist, including both static and dynamic approaches.

N-Variant Systems
One basic dynamic approach for detecting defects is runtime monitoring, in which observed program behavior is compared to some policy or expected oracular output. N-version programming is a special kind of runtime monitoring system in which multiple independent implementations serve as oracles for each other. Input is run through all of the versions simultaneously; if any of them diverge (produce different output), the N-version system may have detected a defect. Unfortunately, Knight and Leveson demonstrated in 1986 that independent teams do not typically create implementations with independent failure modes. An alternative method, called N-variant programming, seeks to remedy this by generating the oracle copies by making transformations to the original program. Researchers have typically used semantics-preserving transformations to generate the variants. N-variant systems generated in this manner can detect certain classes of security defects.

Coarse-Grained Diversity
While the semantics-preserving transformations can be useful in detecting some classes of defects, the limitations of the transformations imply that there are many types of defects they can never detect. Namely, if the defect is in the semantics of the program (i.e. the developers actually code something incorrectly), then semantics-preserving transformations have no chance at revealing them. Instead, N-Prog makes use of much more powerful, non-semantics-preserving transformations. These provide much higher defect-detection power, but are considerably less safe. To validate that its variants still adhere to all of the original program's specifications, N-Prog evaluates all variants against the program's test suite.

Concept
At a high level, the N-Prog algorithm contains two primary phases:
 * The Generation Phase creates many single-edit variants (only one statement-level mutation away from the original). It tests each candidate variant against the test suite. Variants that still pass all of the tests are called neutral variants. Only the neutral variants are passed to the next phase.
 * The Composition Phase receives several neutral single-edit variants and attempts to combine several of them into one multi-edit variant (henceforth, cluster). The clusters are tested for neutrality; neutral clusters are output as variants for use in an N-variant system.

Input
N-Prog takes as input a program written in C and a test suite for that program. The program should pass all of the tests in the test suite (if negative tests are present, automatic program repair techniques such as GenProg could be used to localize and fix the defect). Finally, a number of user-chosen values are incorporated into the N-Prog algorithm:
 * 1) The number of variants, N, to be created and used in the final N-variant system
 * 2) A search budget, x, for generating single-edit variants
 * 3) A search budget, y, for cluster generation
 * 4) Maximum number of mutations, k, to be placed in a cluster

Output
N-Prog returns a set of variants of the original input program.

Walkthrough
The pseudocode for the N-Prog algorithm in in the figure on the right. This section will elaborate on that pseudocode. It begins with the initialization of the Generation Phase, in which many single-edit variants of the program P are made. Specifically, we generate x variants of the program (lines 3,7,8). Each variant is evaluated against the test suite T (line 5). If it passes all tests, it is called neutral and added to the set neutral_vars (line 6).

Variants are created by performing a single mutation on the subject program. N-Prog only mutates statements that are covered by the test suite (lines 4,5). This is because any edit to a statement not covered by a test trivially produces a neutral variant. The specific mutations used in single_mutation are discussed in the section "Mutation Operators".

Once the set neutral_vars has been generated, the Composition Phase begins. In this phase, multiple single-edit variants are combined into larger, muti-edit variants called clusters. Specifically, N-Prog generates at most N clusters (line 11). Each candidate cluster is generated by choosing a random subset of size k of the set neutral_vars (line 12). The candidate cluster is evaluated against the test suite; if it is neutral, it is added to the final output set (lines 13,14). A search budget, y, is imposed on the Composition Phase. Without this budget, termination is not guaranteed (and thus it would not be an algorithm). With this budget system, if y non-neutral clusters are made in a row, N-Prog adjusts k to attempt making smaller clusters, which may be more likely to be neutral (lines 15,17-9). If k ever reaches 1 via this process, then the Composition Phase is actually doing no work (the Generation Phase creates k=1 clusters); in this case, N-Prog returns the current set of final clusters, even if it does not have N elements (lines 20,21).

If the search budget for Composition is never forced to adjust k, then the output of N-Prog is a set of N clusters, each with k edits away from the original program. However, if k is adjusted, the some of the output clusters could have fewer than k edits; additionally, there may be fewer than N clusters output (indeed, it is possible, but empirically unlikely, to output an empty set).

Runtime
The primary cost of running N-Prog comes from multiple evaluations of a program against its test suite. While other aspects (generating variants, clusters, etc.) do take time, running a quality test suite for a non-trivial program is certainly the dominant cost. Therefore, to discuss the runtime of N-Prog, one must only consider the number of test suite evaluations that are possible based on user-provided parameters.

Generation Phase
In this phase, the search budget x defines the number of variants that are created and need to be evaluated against the test suite. Recall that as soon as a variant fails a single test case, it cannot be labeled neutral, so we can avoid running the rest of the tests on that variant. However, in the worst case, all x variants generated would be neutral, requiring x complete test suite evaluations. In practice, about 33% of generated variants of a well-tested program are actually neutral, so in 67% of the Generation Phase, it may be possible to only run partial test suites.

Composition Phase
In the Generation Phase, the worst-case runtime occurs when all generated variants are neutral. The Composition Phase is considerably more complex. We proceed with the assumption that every test suite evaluation must be complete (i.e. never ends early due to a failed test); this is unlikely, but possible if the candidate cluster fails only the last test in the test suite. Then there are two likely ways to achieve worst-case runtime. One way is when every generated cluster is not neutral. First, y full test suite evaluations will be made, then k will be adjusted to $$\lfloor k/2 \rfloor$$. Then the number of test suite evaluations will be $$y*log_2(k)$$. The other likely way to achieve worst-case runtime is for the last of every y clusters to be neutral (i.e. if $$y=50$$, then 49 in a row are non-neutral and the 50th is neutral). If this occurs for each of N iterations, the worst-case running time is $$N*y$$.

So the worst-case runtime of the Composition phase is $$\textbf{max}(y*log_2(k), y*N)$$. Which is actually worse depends on user chosen parameters N and k; it is decided by computing $$\textbf{max}(N,log_2(k))$$.

Role of Randomness
Many aspects of N-Prog involve randomness. It is important to recognize that two independent results of N-Prog could produce vastly different results. In the Generation Phase, the choice of what mutation to to apply is random (in the current implementation, there are two possible mutations and there is a 50-50 chance of which will be selected). In addition, the statements to be mutated are also chosen randomly from the set of covered statements. In the Composition Phase, the set of k edits used to form candidate clusters is chosen randomly from the larger set of neutral single edits.

Mutation Operators
The mutation operators used in N-Prog are taken from GenProg. In particular, it uses the append and delete operators because they are atomic (i.e. not built out of other operators); additionally, it has been shown that they lead to a high neutral rate in single-edit variants. They function as follows: It is possible that the performance of N-Prog could be improved by investigating other mutation operators, but such an investigation has not been done to date.
 * append(x,y) - place a copy of statement x directly after statement y
 * delete(x) - remove statement x

Test Suite Improvement
N-Prog functions best for defect detection when the test suite for the subject program is extremely comprehensive and thorough. Use of N-Prog to detect defects in a program with a weak test suite has been shown to lead to a high false positive rate.

When faced with a weak test suite, N-Prog is best utilized as a tool for detecting weaknesses in the test suite. Consider an N-variant system produced by N-Prog. If it diverges on a particular input, one of two possibilities is true: either that input induces an actual bug, or it is a false positive (a variant diverges, but there was no defect in the original). In the second case, N-Prog would not have been able to generate a variant that diverged on that input if there was a test of that input (the produced variant would not have been neutral). Therefore, that test input should be added to the test suite to increase its effectiveness.

This insight can be combined with a fuzz tester during in-house testing to find new quality additions to the test suite while simultaneously detecting defects in the program.