User:Potok~enwiki

tf-icf

The tf–icf weight (term frequency–inverse corpus frequency) is a variant of tf-idf. In tf-idf the inverse document frequency is found by taking the inverse of the count of the number of times a term appears in a set of documents. For example, if I have 100 documents, and the term "apple" appears in 37 of them, the idf value is 1/37. A drawback with this approach is that if I add a new document to the set, the idf value for each term must be recalculated. This can be computationally expensive for large document sets. The tf-icf approach determines the inverse corpus frequency by using a large corpus or set of existing document, rather than the document set itself. For example, rather than calculating the inverse frequency of apple over 100 documents, the inverse frequency of apple is calculated over a corpus of millions of documents.