Talk:Jaccard index

Why does cosine similarity redirect here? It needs a separate page and the redirect should be removed.
 * Please someone remove capital sigma and find some other letter to qualify for and to replace NOT pixel, — Preceding unsigned comment added by 5.12.231.73 (talk) 03:46, 15 March 2024 (UTC)


 * read that "sigma with no sub" and "stand for pixels"
 * 5.12.231.73 (talk) 03:49, 15 March 2024 (UTC)

Tanimoto coefficient
Dear me, I've finally tracked down a paper actually by Taffee Tanimoto, from 1960, about similarity and distance....

If we assume these are the correct "Tanimoto" functions, then similarity is the same as Jaccard (but expressed over bit vectors) and distance is

$$ T_d(X,Y) = -{\log} _2 ( T_s(X,Y) ) $$

which I haven't seen anywhere in the more modern literature. This is not a distance function, and this is a deliberate choice!

So it would appear that the reading of the vector function as a specialising multiset function is just wrong, but a neat idea... seems to be my invention, not Tanimoto's!

Why the hell did anyone ever express this in terms of vector arithmetic when it's a bitmap operator?????

_____ —Preceding unsigned comment added by RichardThePict (talk • contribs) 22:59, 20 May 2011 (UTC)

I am proposing to replace this entire section with the text below, if nobody objects:

___

The Tanimoto Similarity Coefficient is a generalisation of Jaccard which allows for a similarity to be calculated over multisets.

To calculate Tanimoto Similarity, the multisets must first be viewed as vectors. In some arbitrary order, each element of each set is considered as a vector dimension, and the cardinality of the element within that set is used as the magnitude of that dimension in its representative vector. For example, the multisets {apple, apple, pear} and {banana, pear} may be viewed as the vectors (2,0,1) and (0,1,1) respectively if the elements are considered in alphabetical order.

Cosine Similarity already gives a similarity coefficient over vectors, bounded in [0,1] when all dimensions are positive or zero. However, the Cosine Similarity of the simple sets {apple, pear} and {banana, pear} yields one half, whereas the Jaccard Coefficient of these sets is one third.

The purpose of the Tanimoto Coefficient is thus dual: to give a similarity coefficient over multisets, but also to specialise to the same value as Jaccard Coefficent when only simple sets are considered. The equation identified by Tanimoto (also, anecdotally, by Jaccard 50 years earlier) is:

$$ T(A,B) =\frac{ A \cdot B}{{\vert A\vert}^2 +{ \vert B\vert}^2 - A \cdot B } $$

Notice that the product and magnitude operators are vector, not set, operators. When two sets (not multisets) are viewed as vectors, the dot product of those vectors is equal to the cardinality of their intersection, and the square of each vector magnitude is the cardinality of the set itself. Thus, the formula immediately specialises to the Jaccard Coefficent for simple sets. Tanimoto is bounded within [0,1] if all vector dimensions are positive.

Tanimoto Distance (1 - T(A,B)) is often quoted as being a proper distance metric. This is not true, but is true if all values of each dimension are constant or zero. Tanimoto is therefore a proper distance metric for use over weighted sets modelled as vectors, but not over multisets in general.

Tanimoto is extensively used in chemistry, to give a similarity metric for molecules. Most sources use the term inaccurately, as a synonym for Jaccard.

RichardThePict (talk) 14:22, 9 May 2011 (UTC)

Other sources gives the Tanimoto coefficient as

$$ \frac{A \cdot B}{|A| + |B| + A \cdot B} $$

So without the squares. This also makes more sense when compared to the Jaccard distance.

Some other sources are:
 * http://www.ccl.net/chemistry/resources/messages/2008/04/06.002-dir/index.html
 * http://graphics.med.yale.edu:5080/TriposBookshelf/sybyl/selector/selector_theory5.html

What is the correct definition?

Twanvl (talk) 17:35, 11 November 2008 (UTC)

Another source gives this, as in practical use for molecule similarity, as

$$ t(A,B) =\frac{\vert A \cap B\vert}{\vert A\vert + \vert B\vert - \vert A \cap B\vert } $$

which is superficially the same but doesn't appear to use vector arithmetic, so the magnitude function is set cardinality rather than vector magnitude as defined in this article. The scalar product is of course the same as the cardinality of the intersection unless we're talking about multisets. Are we, by the way, talking about mulitsets?!?! I If not, why does the scalar product feature in the article?

Of course this could just be in error, if someone has transcribed a formula without looking at the meaning of the syntax...

Then, in

we have :

Tanimoto distance. The distance between two sets is computed in the following way:

n1 + n2 - 2*n12 D(S1, S2) = -- n1 + n2 - n12 , where n1 and n2 are the numbers of elements in sets S1 and S2, respectively, and n12 is the number that is in both sets.

Tanimoto apparently defined something in 1947 in an IBM Technical Journal which is hard or impossible to access. However others have related it to a digression by Jaccard and published by him in or around 1903, and it has been known as the Jaccard-Tanimoto Index, I believe. Sorry, I don't have attribution for any of this info, but would love to come to a conclusion about how this function actually is defined! All the citations/references I've found in the research literature seem to just copy each other, sometimes adding mistakes....

130.159.185.103 (talk) 11:50, 6 May 2011 (UTC)   RC

It's intereseting that there is so much rubbish talked about this coefficient.... hopefully this is not because there's a bad Wikipedia entry on the subject!

The point of Tanimoto seems that it is a generalisation of Jaccard, ie it's a vector metric (that can therefore be used for multiset similarity with a simple isomorphism) which specialises to Jaccard when the multiset cardinalities don't exceed 1. This section really needs to be completely rewritten to reflect this - I might do this in a while if nobody proposes anything else.

Also, the bit about the [-1,1] range is garbage; those values are all positive (I believe) and so the function range is [0,1].

RichardThePict (talk) 13:08, 9 May 2011 (UTC)

Divide by zero for empty sets
What happens when both sets are empty sets? The metric gives a divide-by-zero. Should the metric only work for non-empty sets or is there a specific value (0.0 or 1.0) that should be returned in this case? pgr94 (talk) 07:27, 22 December 2010 (UTC)

Everywhere I look, Jaquard is for non-empty sets. Unfortunately, I don't have a dead tree reference that mentions it. Charles Merriam (talk) 17:54, 4 July 2013 (UTC)

Mountford's index of similarity
"See also" points to "Mountford's index of similarity" ... this is a dead link — Preceding unsigned comment added by 67.81.4.49 (talk) 11:24, 15 October 2011 (UTC)

Jaccard Similarity Calculation -- mistake?
In the section "Similarity of asymmetric binary attributes", the article states:

The Jaccard similarity coefficient, J, is given as
 * $$J = {M_{11} \over M_{01} + M_{10} + M_{11}}.$$

This would give a result of 0 where A and B both have a value of 0 (and are, therefore, similar). Shouldn't the correct equation be the following?:
 * $$J = {M_{11} + M_{00} \over M_{01} + M_{10} + M_{11}}.$$

— Preceding unsigned comment added by AmanAhuja (talk • contribs) 02:53, 15 March 2012 (UTC)

Update: never mind, I was conflating the Jaccardian with the Hamming Distance. AmanAhuja (talk) 03:23, 15 March 2012 (UTC)

Tversky coefficient
I think the whole equivalence classification needs even more cleanup. The entry covering Tversky_index speaks of it as a generalization of Tanimoto and Dice. The comment above speaks of Tanimoto being a generalization of Jaccard, the same author speaks in Talk:Tversky_index that this actually is NOT the case. I must say this is quite confusing. LinguistManiac (talk) 06:56, 22 May 2012 (UTC)

Clarification
A diagram would help introduce this concept to beginners, and perhaps a worked example in the context of e.g. NLP eval Leondz (talk) 11:37, 18 March 2014 (UTC)

Set of sets
The comment being removed in the first section claims that the measure is a metric on sets of sets and gives two citations. Neither citation mentions sets of sets, or formulates the metric in a way that would be applicable to a set of sets. — Preceding unsigned comment added by Amoss (talk • contribs) 10:33, 1 October 2014 (UTC)

Redundant comparison on two pages
Jaccard index and Simple_matching_coefficient page both have comparison of each other in their page. It's probably better to make one link to the other.

--Qria (talk) 09:44, 13 November 2016 (UTC)

Similarity of asymmetric binary attributes -- mistake?
Under 'Similarity of asymmetric binary attributes' it says:

M11 + M01 + M10 + M00 = 2*n (n is number of binary features of A and B)

But mustn't it be: M11 + M01 + M10 + M00 = n ?

Difference with the Simple matching coefficient (SMC)
Article states:

"Using the SMC would then induce a bias by systematically considering, as more similar, two customers with large identical baskets compared to two customers with identical but smaller baskets, thus making the Jaccard index a better measure of similarity in that context."

However, customers with identical baskets will have a similarity of 1 regardless of the size of their basket for both Jaccard and SMC measures. 89.101.126.62 (talk) 14:14, 18 February 2017 (UTC)


 * I absolutely agree, thx for bringing it up! This section had a few mistakes actually. I rewrote the paragraph, hopefully that should fix this :)7804j (talk) 20:40, 4 March 2017 (UTC)