User:SergioJimenez/sandbox

The soft cardinality, as the classic set cardinality, is a measure of "the number of elements of a set" but considering the similarities among its elements. The idea is to make a "soft" count of the elements where elements that have similarities with others contribute less to the soft cardinality than pretty differentiated elements. Thus, the soft cardinality of a set A, whose elements are barely identical, should be lower than the soft cardinality of a set B with the same number of elements but, whose elements are mostly different. For instance, the set A={horse, donkey, zebra} should have a lower soft cardinality than the set B={horse, bee, snake}.

While the classic set cardinality returns a natural number ($$\mathbb{N}^0$$), the soft cardinality returns a real number that measures the amount of "different" elements in a set. This measure is "soft" in comparison to the classic set cardinality because the cardinality of a set A having n slightly different elements is n but its soft cardinality is larger but close to 1.

The soft cardinality requires a notion of similarity among the elements of the set. This notion is usually provided by a binary inter-element similarity function that returns scores in [0,1] interval, having 1 as the maximum similarity level and 0 the minimum. Clearly, if that function returns 1 only for identical elements and 0 otherwise, the classic and soft cardinalities become equivalent.

The soft cardinality of a set A is usually denoted |&thinsp;A&thinsp;|', with a vertical bar on each side and an apostrophe to differentiate it from the classic cardinality. A subscript that indicates the used inter-element similarity is also used to differentiate soft cardinalities that uses different similarities, i.e. $$|A|_{sim}^{'}$$.

Definition
The soft cardinality of a set is the cardinality of the union of its elements treated (themselves) as sets. Thus, for a set $$A=\left\{ a_{1},a_{2},\ldots,a_{n}\right\}$$, the soft cardinality of $$A$$ is:

$$|A|^{'}=\left|{\textstyle \bigcup_{i=1}^{\left|A\right|}}a_{i}\right|$$

This definition implies that the elements of $$A$$ can be divided in sub-elements, which in most of the cases is not possible. Alternatively, a pairwise similarity function $$sim(a_{i},a_{j})$$ that compares two elements can be used to approximate the soft cardinality by the following expression:

$$|A|_{sim}^{'}\simeq\sum_{i}^{n}\left[\sum_{j}^{n}sim\left(a_{i},a_{j}\right)^{p}\right]^{-1}$$ ; $$p>0$$

The similarity function sim should return values in [0,1] interval and satisfy at least $$sim(x,x)=1$$ for any x. The parameter p controls the "softness" of this cardinality; if $$p\rightarrow\infty$$ then $$\left|A\right|\approx\left|A\right|'$$. Similarly, if $$0\leftarrow p$$ then $$\left|A\right|'\approx1$$. For most applications $$p=1$$ should work or it can be adjusted for optimizing the performance of the task at hand. Besides, this approach also fulfills the requirement that when sim(x,y) is a crisp function, which returns 1 only for identical pairs and 0 otherwise, the soft cardinality behaves as the classic set cardinality.

Example
Let A={horse, zebra, donkey} be a set.

Lets assume the following pairwise similarities among the elements of A.

The soft cardinality of A using p=1 is:

$$\left|A\right|^{'}\simeq\frac{1}{1.00+0.77+0.83}+\frac{1}{0.77+1.00+0.95}+\frac{1}{0.83+0.95+1.00}=1.12$$

This result can be interpreted by saying that set A has basically 1 element and 0.12 additional elements. Thus, the soft cardinality can be also interpreted as the amount of non-redundant information in a set measured in "elements" units.

Comparing Sets
Given the set-based definition for the soft cardinality, it is possible to compare two sets $$A=\left\{ a_{1},a_{2},\ldots,a_{n}\right\}$$ and $$B=\left\{ b_{1},b_{2},\ldots,b_{m}\right\}$$ by:

$$\left|A\cup B\right|^{'}=\left|\left({\textstyle \bigcup_{i=1}^{n}}a_{i}\right)\cup\left({\textstyle \bigcup_{i=1}^{m}}b_{i}\right)\right|$$

$$\left|A\cap B\right|^{'}=\left|A\right|^{'}+\left|B\right|^{'}-\left|A\cup B\right|^{'}$$

Weighted Soft Cardinality
$$|A|_{sim,w}^{'}\simeq\sum_{i}^{n}\left[w(a_{i})\left[\sum_{j}^{n}sim(a_{i},a_{j})^{p}\right]^{-1}\right]$$

Hierarchical Set Comparison
The soft cardinality is a measure used for building similarity functions. In turn, the soft cardinality is based in an inter-element similarity function. This recursive relation can be exploited to build similarity functions for objects with a compositional hierarchical structure, e.g. documents are composed of paragraphs, paragraphs of sentences and so on. Both similarity scores and element weights propagate from the lowest recursion levels to the top.

Example
Lets assume a word-to-word similarity function such as the Jaro-Winkler measure, a term (word) weighting function such as the inverse document frequency, and the required splitter and tokenizer functions for dividing documents into paragraphs, paragraphs into sentences and so on. Then, the soft cardinality in combination with a cardinality-based resemblance coefficient such as the Dice's index can be used to build a document similarity function as follows (in this example subindices w,p,s and d in similarity and weighting functions mean respectively word, sentence, paragraph and document):

The similarity function between two words $$w_{a}$$ and $$w_{b}$$ is:

$$sim_{w}(w_{a},w_{b})=JaroWinkler(w_{a},w_{b})$$

The weight for a word $$w$$ is:

$$weight_{w}(w)=idf(w)$$

The similarity function between two sentences $$s_{a}$$ and $$s_{b}$$ is:

$$sim_{s}(s_{a},s_{b})=\frac{2\left|s_{a}\cap s_{b}\right|_{sim_{w},weight_{w}}^{'}}{\left|s_{a}\right|_{sim_{w},weight_{w}}^{'}+\left|s_{b}\right|_{sim_{w},weight_{w}}^{'}}$$

The weight for a sentence $$s$$ came from the aggregation of its words weights by the soft cardinality:

$$weight_{s}(s)=\left|s\right|_{sim_{w},weight_{w}}^{'}$$

The similarity function between two paragraphs $$p_{a}$$ and $$p_{b}$$ is:

$$sim_{p}(p_{a},p_{b})=\frac{2\left|p_{a}\cap p_{b}\right|_{sim_{s},weight_{s}}^{'}}{\left|p_{a}\right|_{sim_{s},weight_{s}}^{'}+\left|p_{b}\right|_{sim_{s},weight_{s}}^{'}}$$

Similarly to sentence weights, the paragraph weights came from the soft cardinality at sentence level:

$$weight_{p}(p)=\left|p\right|_{sim_{s},weight_{s}}^{'}$$

Finally, the similarity between two documents $$d_{a}$$ and $$d_{b}$$ is:

$$sim_{d}(d_{a},d_{b})=\frac{2\left|d_{a}\cap d_{b}\right|_{sim_{p},weight_{p}}^{'}}{\left|d_{a}\right|_{sim_{p},weight_{p}}^{'}+\left|d_{b}\right|_{sim_{p},weight_{p}}^{'}}$$