User:Pmanderson/Optimal classification

Optimal classification is an arrangement of attributes for a set of elements in an attribute-value system which minimizes the number of attribute queries necessary to identify a particular element within an element set.

The algorithm used for this purpose first sorts the elements in ascending order according to the value of each attribute as determined by its current position in the sequence of attributes. An empirical separatory value is then calculated for each characteristic in the target set and compared to the highest separatory value calculated for all preceding sets. This process repeats until an optimal sequence is found, all permutations have been evaluated or a time limit expires.

The number of elements which are classified, as well as the rapidity with which they can be identified, is dependent upon both the radix and the exponent of the data. A larger radix will allow faster identification by means of excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.

What follows is a rigorous explanation of the algorithm.

Truth table-related equations

 * $$ G = V^C $$, where:


 * G is the group size or total number of elements in the group,
 * V is the highest value of logic in the group,
 * C is the highest number of characteristics in the group.


 * $$ R = V^K$$, where:


 * R is the set size or total number of elements in the set,
 * V is the highest value of logic in the group,
 * K is the highest number of characteristics in the set.

Maximum number of pairs of elements to separate
Maximum number of pairs of elements to separate refers to a matrix in which each element is compared with every other element to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory, therefore the maximum possible numbers of pairs that can be separated is determined by the following equation:
 * $$ p_\max = \frac{\left[{G (G-1)}\right]}{2} $$, where:


 * pmax is the maximum number of pairs to separate, and
 * G is the total number of elements in the group.

Order of elements
The elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.


 * $$ e_i = \left[\sum_{j=0}^C \left[v_{i,j} V^{(C-j)}\right]\right]$$, where:


 * ei is the element truth table value in the group,
 * C is the highest number of characteristics in the group,
 * V is the highest value of logic in the group,
 * v is the value of logic of each characteristic in the group,
 * j is the jth characteristic index, where:


 * j = 0..K and where:
 * K is the number of characteristics in the set,


 * i is the ith element index, where:


 * i = 0..G and where:
 * G is the number of elements in the group.

The general identification equation

 * $$ S_j = \frac{1-{V^{-j}}}{1-V^{-C}}$$, where:


 * Sj is the theoretical separatory value per jth characteristic,
 * C is the highest number of characteristics in the group,
 * V is the highest value of logic in the group and
 * j is the jth characteristic index in the set, where:


 * j = 0..K and where:
 * K is the number of characteristics in the target set.

Minimal number of characteristics to result in theoretical separation

 * $$ t_\min = \frac{\log G}{\log V}$$, where:


 * tmin is the minimal number of characteristics to result in theoretical separation,
 * G is the number of elements in the group and
 * V is the highest value of logic in the group.

Empirical separation
(Please note that application of these equations can not be expressed entirely using conventional mathematical notation or normal MathCAD expressions without including a MathCAD function or MathCAD user program. The equations have been fully implemented using the Zbasic and the Visual Basic programming languages.)


 * $$ S_j = \frac{\left[(G^{2})-\sum_{i=0}^{R} n_i^{2}\right]}{2}$$, where:


 * Sj is the empirical separatory value,
 * j is the jth characteristic index,
 * G is the number of elements in the group,
 * R is the number of elements in the set, where:


 * R = VK, where:


 * V is the highest value of logic in the group and
 * K is the number of characteristics in the set.


 * i is the truth table index value of the set, where:


 * i = 0...R and


 * $$ n_i = \sum_{j=0}^K v_{i,j} V^{(K-j)}$$, where:


 * $$ n_i$$ is the truth table value of the ith element in the set
 * v is the logic value of the jth characteristic.

Flag recognition &mdash; application example
(A flag identification example from Neural Network Identification example.)

Although the following example lacks an intuitive sense of optimization it serves as a good example of how optimization can reduce the number of queries. While the method of optimal classification is highly beneficial for reducing the number of queries required for manual identification, automated identification may be better served by use of a neural network.

Flag overlay grid
Designated areas (characteristics) for sampling background colors (states) of all flags (elements).

The overlay is used to determine the color of each area for each flag and the color is recorded in the table as the logical state of the area. The table data is then submitted to the optimization program and processed until an optimal empirical separatory value is obtained.

Systematic query
Starting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact, since the colors in columns "D", "E" and "F" are the same for each remaining flag, we would not be able to eliminate any remaining flags until column "G" where the color BLUE would provide a unique answer to the final necessary query. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries using the systematic query method to establish the identity of the flag in our possession as belonging to the Netherlands.

Minimized query
The results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The minimized query algorithm has optimized the order of characteristics and minimized the number of queries that are required to identify the flag. (Please note that there may be more than one optimal solution.)

Primary reference
Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification." ISBN 0125448503

Specific applications

 * Artificial neural networks.
 * Compute minimum number of hidden nodes.
 * Minimize training time.
 * Diagnostic troubleshooting charts.
 * Dynamic classification of publications.
 * Triage
 * Continuous integrated triage