Talk:Okapi BM25

"Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative..." No doubt I'm missing something, but I don't see how that happens. Example? &mdash; JLundell  talk   01:47, 10 June 2014 (UTC)
 * $$\ln a$$ is negative if $$a \le 1$$. $$\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}=1$$ is equivalent to $$N = 2n(q_i)$$. So the IDF is negative if $$n(q_i)\ge N$$. Ireas ask! 00:32, 31 October 2015 (UTC)

Is BM25, an algorithm developed in the 70's, really "state-of-the-art." Doesn't "state-of-the-art" mean that it should be representative of the most advanced, highest-performing algorithm in current use? Word2Vec and PCA (LSI) seem like more advanced algorithms that achieve better performance by many measures and are in widespread use. And if the IR performance metric is user utility, the most popular search engines do not use BM25 or merely an TFIDF-enhanced BM25, but rather ensemble approaches. Hobsonlane (talk) 22:20, 2 March 2016 (UTC)

BM25 was not developed in the 70s, and the article doesn't say it was: "It is based on the probabilistic retrieval framework developed in the 1970s and 1980s" (my emphasis). The first reference on the article indicates TREC2 (1993) for BM11 and BM15, and TREC3 (1994) for BM25. Also from the references, BM25F looks to be 2004 and BM25+ 2011. And "state-of-the-art" is also qualified: "BM25, and its newer variants, e.g. BM25F [...] represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval" (again my emphasis). 2406:E006:29EF:1:8E89:A5FF:FECA:57FE (talk) 02:51, 3 March 2016 (UTC)

Yea, I think the reader would benefit if those emphasised points in your comment were to "pop" in the wording of the intro paragraph, rather than the "state-of-the-art" claim/promotion that caught my eye. Perhaps "state of the art" should be downplayed, deleted, or moved to a later section. Even if if we worded it clearly and accurately with something like "Among TFIDF-based algorithms, some variants of BM25 are considered state of the art" would not be all that helpful for the reader just getting to know BM25. More useful for them are the bits about its history (like is pedigree going back to the 70s and 80s) and current use in modern products (like Bing and Google, or whatever other examples of "TFIDF-based state of the art" IR systems you know of that use BM25). Hobsonlane (talk) 16:53, 23 March 2016 (UTC)

Link http://kak.tx0.org/IR/TFxIDF in the References appears to be dead. Archive.org did pick up a snapshot of it: https://web.archive.org/web/20160916025726/http://kak.tx0.org/IR/TFxIDF Would make the change directly, but I don't know Wikipedia editing protocol. 98.218.179.112 (talk) 00:47, 26 June 2017 (UTC)Anonymous

Note that the $$(k_1+1)$$ in the numerator of the ranking function is constant and thus should have no influence on the ranking. It seems that the function is often cited in this form, without any explanation. As is pointed out in one of the papers cited (staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf) "The reason for including it was to make the final formula more compatible with the RSJ weight used on its own. If it is included, then a single occurrence of a term would have the same weight in both schemes.". It might be worth pointing this detail out in the article. Maybe it's just me, but I always found the existence of this term slightly confusing. There is an interesting discussion on this topic in the bug tracker of Apache Lucene (https://issues.apache.org/jira/browse/LUCENE-8563). Icannotthinkofanythinguseful (talk) 12:10, 19 November 2020 (UTC)