Talk:Semantic Scholar

Classification of citations
Bianca Kramer notes that Semantic Scholar is "showing whether a citation cites methods, results or background". I feel this would be worth mentioning, will look later for a secondary source. Nemo 07:01, 25 October 2019 (UTC)

"Human-Computer interaction" capitalization
Minor remark: Is there a reason I miss for uppercase "H" and "C" (lowercase in the article Human–computer interaction)? Best, --Marsupium (talk) 23:45, 23 December 2021 (UTC)

Misleading text: "Does not search for material behind a paywall"
Current text is: "Semantic Scholar is free to use and unlike similar search engines (i.e. Google Scholar) does not search for material that is behind a paywall.[citation needed]" which implies that it only provides results which are not behind a paywall, while actually meaning: "it doesn't search in between and across the material behind a paywall".

I wish someone (with good English skills!) makes it clear to avoid misunderstandings. Kouroshkoratamadia (talk) 11:53, 27 December 2023 (UTC)


 * Yes, this is clearly wrong. It's easy to verify by just searching for any article that's published by a non-open access journal and seeing if it comes up. However, the current citation does use the "behind a paywall" wording without clarifying what that means, so I think it would require a different clarifying source if it's to be changed? Joshisanonymous (talk) 16:50, 20 April 2024 (UTC)
 * It does not allow anyone to link articles that are not freely available on line; it only recognizes articles that are available freely. FrankieItalo (talk) 01:11, 6 May 2024 (UTC)


 * Ping Kouroshkoratamadia and Joshisanonymous . You're correct. The source that is quoted above and is in the WP article is wrong.
 * Semantic Scholar (SS) is free to use
 * SS searches for and extracts information from freely available online journal articles
 * however, SS ALSO searches for material behind paywalls! In other words, Semantic Scholar can (and does) access many articles that are not published in open access scholarly journals.
 * Descriptions of SS are very misleading! Even the U.S. Department of Commerce Research Library, LITERATURE SEARCH: SEMANTIC SCHOLAR gets it wrong, "[SS] does not search for material that is behind a paywall."


 * I found 3 explanations of how Semantic Scholar (SS) works, regarding paywalls.
 * ONE From the SS FAQ, Content section Q1. Where does Semantic Scholar source papers from?  A1. "Semantic Scholar sources its content via web indexing and from partnerships with scientific journals... You can find a list of our sources by visiting our publisher partners page... We index content from PubMed, arXiv, Springer Nature, and more."
 * Q2. How do I access the full text of a paper?  A2.  "...you will find access options below the abstract of the paper located on the paper detail page... you will see options to View PDF, View Paper, or View via Publisher that will redirect you to a full-text PDF... If the paper is not freely accessible, the publisher website has options to purchase the paper. For more information, see How do I access a PDF using my institutional affiliation?"
 * This is nothing different from Google Scholar or any other research paper repository with subscription-access only journals.


 * TWO In A1 above, publisher partners links here: https://www.semanticscholar.org/about/publishers University of Chicago Press is listed. This is how U Chicago Press describes its partnership with SS, emphasis mine: "Articles published in University of Chicago Press journals will now appear in the Semantic Scholar corpus, providing readers with bibliographic information and article summaries. Each article entry links directly to the journal’s webpage, so subscribers can read the full text or download a PDF."
 * So, if the SS citation is for an article in an open access journal, you can read it or download it. All other article citations returned by SS are paywalled.


 * THREE In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), this paper runs the citation search on open access papers only, S2ORC: The Semantic Scholar Open Research Corpus (pdf). On p. 4570, "Papers in SS are derived from numerous sources: obtained directly from publishers... from arXiv or PubMed, or crawled from the open Internet." This means that most of the papers that SS gets come directly from publishers and are not open access. The S2ORC consist of all the papers in the Semantic Scholar corpus that are in English, have abstracts and are open access. The SS full corpus is approximately 300M journal articles. Some filtering is done to get to 81.1M papers.
 * See pp. 4972−4973 and Table 3. "Our publisher-provided abstract coverage is 90.4%, or 73.4M papers. Our PDF coverage is 35.6%, or 28.9M papers... we extract bibliography entries for 27.6M of the 28.9M PDFs. We identify 8.1M of the 28.9M PDFs as open access, and we provide full text for all papers in this open access subset. Using these extracted bibliographies, we resolve a total of 380.5M citation links between papers..."
 * Only 10% of the 81 million papers in the curated subset are open access PDFs! --FeralOink (talk) 23:04, 10 July 2024 (UTC)

incorrect analyses
There are serious problems with its classifications. It uses only single letter first initials and therefore mixes across all kinds of fields for common names. It throws anything in a foreign language together without analysis; it needs to analyze foreign languages as well as English. It does not allow scholars to correct errors-- for instance, it lists reviews as articles under the author(s) of the book reviewed, which is totally inappropriate. It also arbitrarily separates sections of one author's works by what it thinks is the subject and will not allow combining of pages by the author concerned. FrankieItalo (talk) 01:15, 6 May 2024 (UTC)


 * SS (S2?) seems kind of useless. I wonder why the Paul Allen Institute for AI released it given the sort of data quality problems you described! They've been working away on SS since 2017 or earlier. I didn't read it, but this paper might address the name mix-ups you mentioned, S2AND: A Benchmark and Evaluation System for Author Name Disambiguation.
 * The problems with non-English names (and probably anything that uses the Cyrillic alphabet) remind me of the OFAC and FinCEN lists that I used to work with. I work/worked in bank risk management, and I could not believe how expensive and error-prone some of the sanctions solutions "services" were! So many false negatives due to the first 3 sentences of what you wrote. Paul Allen and BERT AI should do much better.
 * I noticed in the SS FAQ that it is difficult or impossible for scholars to correct errors of fact in the citation entries for their work, which is ridiculous.--FeralOink (talk) 23:35, 10 July 2024 (UTC)