Wikipedia:Reference desk/Archives/Computing/2018 January 24

= January 24 =

Cosine similarity
I have a moderate sized collection (let's say usually no more than a few thousand) of documents that I'd like to compare by cosine similarity. Is there a better way than the obvious O(n**2) method? Both the online (I have N existing documents and when a new one arrives, I want to find similar ones from the existing pile) and offline (I have N documents and want to cluster them, maybe somehow processing them simultaneously) cases are of interest. Thanks. (Note: I'd have to repeatedly solve the above problem many times, so sitting through an O(n**2) calculation a single time doesn't handle it.) 173.228.123.121 (talk) 02:25, 24 January 2018 (UTC)
 * Well... If you want to compute a table that contains all distances (cosine similarities), it will have to be O(n**2) since the table itself contains O(n**2) values. Adding a new element to an existing, already-processed set will necessarily be O(n) if you want to compute all distances with old elements.
 * It looks like you want something else, but it is not clear what. There could be tricks depending on what you want.
 * For instance, in the online case, if you want to know what is the old element that is the farthest away from an incoming element, and the dimensionality is "small" (smaller than log(n) or something), you could precompute the convex hull of the existing documents, since the most-distant element is guaranteed to be part of the convex hull. That precomputation can also help if you want the closest element, but only if the new element has a decent chance to fall outside the convex hull. (If the dimensionality is not "small", all or most elements will be on the convex hull, so it saves nothing.)
 * I will note that the current version of article about cluster analysis claims that O(n**2) is an unbeatable complexity for hierarchical clustering, though I do not know why or whether it applies to any sort of clustering algorithm. Tigraan Click here to contact me 17:29, 24 January 2018 (UTC)
 * Thanks! Yes, the dimensionality will generally be pretty low, so I'll look into that approach.  173.228.123.121 (talk) 04:30, 26 January 2018 (UTC)