User:Grigori sidorov/sandbox

Soft Cosine Measure is a measure of “soft” similarity between two vectors, i.e., the measure that considers similarity of pairs of features. The traditional cosine similarity considers the Vector Space Model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which allows generalization of the concepts of cosine measure and also the idea of similarity (soft similarity).

For example, in the field of Natural Language Processing (NLP) the similarity between features is quite intuitive. Features such as words, n-grams or syntactic n-grams can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus are mapped to different dimensions in VSM; yet it is obvious that they are related semantically. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculation of the soft cosine measure, the matrix $$s$$ of similarity between features is introduced. It can be calculated using Levenshtein distance or other similarity measures, e.g., various WordNet similarity measures. Then we just multiply by this matrix.

Given two N-dimension vectors a and b, the soft cosine similarity is calculated as follows:

$$\begin{align} soft\_cosine_1(a,b)= \frac{\sum\sum\nolimits_{i,j}^N s_{ij}a_ib_j}{\sqrt{\sum\sum\nolimits_{i,j}^N s_{ij}a_ia_j}\sqrt{\sum\sum\nolimits_{i,j}^N s_{ij}b_ib_j}}, \end{align} $$

where $$s_{ij}=similarity(feature_i,feature_j).$$

If there is no similarity between features ($$s_{ii}=1$$, $$s_{ij}=0$$ for $$i\ne j$$), the given equation is equivalent to the conventional cosine similarity formula.

The complexity of this measure is quadratic, which makes it perfectly applicable to real world tasks. The complexity can be even transformed to linear.