User:Wz6231/Draft of Fast Correlation-Based Filter

In machine learning, Fast Correlation-Based Filter, which based on the concept of predominant correlation,is the process of characterizing a class of relevant features along with eliminating redundant feature. A feature is an attribute that describes an object, if an object is viewed as a p-dimensional point, then it has p features, the goal of feature selection is to select features that can characterize a class of objects. In information technology, however, collection of data sets is becoming larger and larger day by day, the tremendous pressure being placed on the feature selection method in order to trimming high dimensional data in the purpose of lessen computational expense and boost accuracy. The predominant correlation, that focus on filter model since wrapper model is computationally expensive, to deal with high dimensional data which can effectively remove both irrelevant and redundant features with less that quadratic time complexity.

Related methodologies
Symmetrical Uncertainty(SU) is applied to evaluate the usefulness of features through feature subsets. Compared with Entropy -based measures, Symmetrical Uncertainty(SU) can be applied to measure correlations between continuous features, also compensates for information gain ’s bias toward attributes.

Predominant Correlation: the correlation between a feature Fi (Fi∈S) and the class C is predominant iff SUi,c ≥ß and &forall; Fj ∈ S’ (j≠i), there exists no Fj such that SUj,i ≥SUi,c. If there exists such Fj to a feature Fi, we call it a redundant peer to Fi.

Predominant Feature: a feature is predominant to the class, iff its correlation to the class is predominant or can become predominant after removing its redundant peers.

Algorithms
Fast Correlation-Based Filter algorithm using symmetrical uncertainty as the usefulness method, first remove those irrelevant features from the feature set, then eliminate the redundant features inside the relevant features which generate from the first step. The key to the first step is what we called C-correlation, denote as SUi,c, which indicates the correlation between a feature Fi and the class C within a data set S contains N features and a class C. Subset S’ of relevant features can be decided by a user-defined threshold SU value &beta;, such that Fi&isin;S’, 1&le; i &le; N,  SUi,c &ge; &beta;. The crucial point of the second step is another correlation, named F-correlation, that is correlation between feature Fi and feature Fj in subset S’, denotes as SUi,j. If SUi,j of two feature in S’ is higher than a threshold value, one of them may be removed from S’. Let the threshold equals to SUj,c, when SUj,c &le; SUi,c & SUj,c &le; SUi,j hold, remove Fj from Subset S’, i.e. correlation between Fi and class C is Predominant Correlation. The reason is when the above condition holds, Fi is correlated to both the class C and Fj, also Fi is more correlated with class C compared with Fj, keep Fi and removing Fj will reduce redundancy and keep more information to predict the class. Here, the correlation between Fi and class C is called predominant correlation. Pseudocode and flow chart for this basic version follows:

input:  S(F1, F2, …, FN, C)                          // a training data set containing N features and class labels C           ß                                            // a predefined threshold output: Sbest                                        // an optimal subset for i: 1 … N      compute SUi,c between C and Fi; if (SUi,c &ge; ß) append Fi to S'list; end if end for sort S'list according to non-decreasing SUi,c value; // the feature with the largest SUi,c value is always a predominant feature and can be a starting point to delete other features Fp &larr; the first element of S'list while Fp is non-empty do Fq &larr; the element of S'list that after Fp        if (Fq  is non-empty) while Fq is non-empty do F'q=Fq if (SUp,q >=SUq,c) delete Fq from S'list Fq &larr; the element of S'list that after F’q else Fq &larr; the element of S'list that after Fq                 end if end while end if Fp &larr; the element of S'list that after Fp end while Sbest&larr; S'list

Corresponding flow chart:



Example
The algorithm above is executed below:

input:  S(F1, F2, F3, F4, F5, F6, F7, C)    //a training data set containing 7 features and class labels C

ß                 //a predefined threshold

output: Sbest                         // an optimal subset

After calculation we get SU1,c >SU2,c >SU3,c >SU4,c >SU5,c > ß >SU6,c >SU7,c

So we delete F6 and F7 and get S'list = { F1, F2, F3, F4, F5}.

① First iteration: Fp = F1, Fq = F2, F3, F4 and F5 respectively during the inner loop goes,

after calculation we get SU1,2 >SU2,c, SU1,3 >SU3,c , SU1,4 <SU4,c , SU1,5 <SU5,c ,

So we delete F2 and F3 from  S'list and now S'list = { F1, F4, F5}. The first iteration terminates.

② Second iteration: Fp = F4, Fq = F5 during the inner loop goes,

after calculation we get SU4,5 <SU5,c

So we keep F5 in the S'list and now S'list = { F1, F4, F5}. The second iteration terminates.

③ Third iteration: Fp = F5, Fq = ∅. Since Fq = ∅, the inner loop terminates immediately, besides F5 is the last element of S'list so Fp = ∅ later and whole iteration comes to an end.

In the end, output Sbest= { F1, F4, F5}. We delete F6 and F7 because of irrelevancy, remove F2 and F3 since they are redundant features.

Running time
The time complexity refer to identify relevant features is linear with related to the number of features N, when dealing with redundant features, the time complexity is approximately O(N logN) in terms of N. Therefore, the total time complexity in terms of M instance in a data set is O(MN logN).

Application
There are various feature selection methods that extensively used in many applications, such as analysing DNA microarrays, text processing, gene expression and combinatorial chemistry, to name a few.

Implementation
Implementations are available for Java language, personal implementation, online software.

Related algorithm
Relief-F that relies on relevance evaluation but cannot identify redundancy features, Correlation Feature Selection do not have strong scalability in the context of high dimension data, other pertinent algorithm including consistency based feature selection, MDL principle and the like.