Wikipedia:Wikipedia Signpost/2023-07-17/Recent research

Wikipedia and open access

 * Reviewed by Nicolas Jullien

From the abstract: :  "we analyze a large dataset of citations from Wikipedia and model the role of open access in Wikipedia's citation patterns. We find that open-access articles are extensively and increasingly more cited in Wikipedia. What is more, they show a 15% higher likelihood of being cited in Wikipedia when compared to closed-access articles, after controlling for confounding factors. This open-access citation effect is particularly strong for articles with low citation counts, including recently published ones. Our results show that open access plays a key role in the dissemination of scientific knowledge, including by providing Wikipedia editors timely access to novel results."

Why does it matter for the Wikipedia community?
This article is a first draft of an analysis of the relationship between the availability of a scientific journal as open access and the fact that it is cited in the English Wikipedia (note: although it speaks of "Wikipedia", the article looks only at the English pages). It is a preprint and has not been peer-reviewed, so its results should be read with caution, especially since I am not sure about the robustness of the model and the results derived from it (see below). It is of course a very important issue, as access to scientific sources is key to the diffusion of scientific knowledge, but also, as the authors mention, because Wikipedia is seen as central to the diffusion of scientific facts (and is sometimes used by scientists to push their ideas).

Review
The results presented in the article (and its abstract) highlight two important issues for Wikipedia that will likely be addressed in a more complete version of the paper:
 * The question of the reliability of the sources used by Wikipedians
 * → The regressions seem to indicate that the reputation of the journal is not important to be cited in Wikipedia.
 * → Predatory journals are known to be more often open access than classical journals, which means that this result potentially indicates that the phenomenon of open access reduces the seriousness of Wikipedia sources.

The authors say on p. 4 that they provided "each journal with an SJR score, H-index, and other relevant information." Why did they not use this as a control variable? (this echoes a debate on the role of Wikipedia: is it to disseminate verified knowledge, or to serve as a platform for the dissemination of new theories? The authors seem to lean towards the second view: p. 2: "With the rapid development of the Internet, traditional peer review and journal publication can no longer meet the need for the development of new ideas".)


 * The solidity of the paper's conclusions
 * The authors said: "STEM fields, especially biology and medicine, comprise the most prominent scientific topics in Wikipedia [17]." "General science, technology, and biomedical research have relatively higher OA rates."
 * → So, it is obvious that, on average, there are more citations of Open Access articles in Wikipedia (than in the entire available research corpus), and explain that open access articles are cited more.
 * → Why not control for academic discipline in the models?

More problematic (and acknowledged by the authors, so probably in the process of being addressed), the authors said, on p.7, that they built their model with the assumption that the age of a research article and the number of citations it has both influence the probability of an article being cited in Wikipedia. Of course, for this causal effect to hold, the age and the number of citations must be taken into account at the moment the article is cited in Wikipedia. For example, if some of the citations are made after the citation in Wikipedia, one could argue that the causal effect could be in the other direction. Also, many articles are open access after an embargo period, and are therefore considered open access in the analysis, whereas they may have been cited in Wikipedia when they were under embargo. The authors did not check for this, as acknowledged in the last sentence of the article. Would their result hold if they do their model taking the first citation in the English Wikipedia, for example, and the age of the article, its open access status, etc. at that moment?

In short
Although this first draft is probably not solid enough to be cited in Wikipedia, it signals important research in progress, and I am sure that the richness of the data and the quality of the team will quickly lead to very interesting insights for the Wikipedia community.

Related earlier coverage

 * "Quantifying Engagement with Citations on Wikipedia" (about a 2020 paper that among other results found that "open access sources [...] are particularly popular" with readers)
 * "English Wikipedia lacking in open access references" (2022)

"Controversies over Historical Revisionism in Wikipedia"

 * Reviewed by Andreas Kolbe

From the abstract:  This study investigates the development of historical revisionism on Wikipedia. The edit history of Wikipedia pages allows us to trace the dynamics of individuals and coordinated groups surrounding controversial topics. This project focuses on Japan, where there has been a recent increase in right-wing discourse and dissemination of different interpretations of historical events.

This brief study, one of the extended abstracts accepted at the Wiki Workshop (10th edition), follows up on reports that some historical pages on the Japanese Wikipedia, particularly those related to World War II and war crimes, have been edited in ways that reflect radical right-wing ideas (see previous Signpost coverage). It sets out to answer three questions:


 * 1) What types of historical topics are most susceptible to historical revisionism?
 * 2) What are the common factors for the historical topics that are subject to revisionism?
 * 3) Are there groups of editors who are seeking to disseminate revisionist narratives?

The study focuses on the level of controversy of historical articles, based on the notion that the introduction of revisionism is likely to lead to edit wars. The authors found that the most controversial historical articles in the Japanese Wikipedia were indeed focused on areas that are of particular interest to revisionists. From the findings:

Articles related to WWII exhibited significantly greater controversy than general historical articles. Among the top 20 most controversial articles, eleven were largely related to Japanese war crimes and right-wing ideology. Over time, the number of contributing editors and the level of controversy increased. Furthermore, editors involved in edit wars were more likely to contribute to a higher number of controversial articles, particularly those related to right-wing ideology. These findings suggest the possible presence of groups of editors seeking to disseminate revisionist narratives.

The paper establishes that articles covering these topic areas in the Japanese Wikipedia are contested and subject to edit wars. However, it does not measure to what extent article content has been compromised. Edit wars could be a sign of mainstream editors pushing back against revisionists, while conversely an absence of edit wars could indicate that a project has been captured (cf. the Croatian Wikipedia). While this little paper is a useful start, further research on the Japanese Wikipedia seems warranted.

See also our earlier coverage of a related paper: "Wikimedia Foundation builds 'Knowledge Integrity Risk Observatory' to enable communities to monitor at-risk Wikipedias"

Wikipedia-based LLM chatbot "outperforms all baselines" regarding factual accuracy

 * Reviewed by Tilman Bayer

This preprint (by three graduate students at Stanford University's computer science department and Monica S. Lam as fourth author) discusses the construction of a Wikipedia-based chatbot:  "We design WikiChat [...] to ground LLMs using Wikipedia to achieve the following objectives. While LLMs tend to hallucinate, our chatbot should be factual. While introducing facts to the conversation, we need to maintain the qualities of LLMs in being relevant, conversational, and engaging." The paper sets out from the observation that  "LLMs cannot speak accurately about events that occurred after their training, which are often topics of great interest to users, and [...] are highly prone to hallucination when talking about less popular (tail) topics. [...] Through many iterations of experimentation, we have crafted a pipeline based on information retrieval that (1) uses LLMs to suggest interesting and relevant facts that are individually verified against Wikipedia, (2) retrieves additional up-to-date information, and (3) composes coherent and engaging time-aware responses. [...] We focus on evaluating important but previously neglected issues such as conversing about recent and tail topics. We find that WikiChat outperforms all baselines in terms of the factual accuracy of its claims, by up to 12.1%, 28.3% and 32.7% on head, recent and tail topics, while matching GPT-3.5 in terms of providing natural, relevant, non-repetitive and informational responses."

The researchers argue that "most chatbots are evaluated only on static crowdsourced benchmarks like Wizard of Wikipedia (Dinan et al., 2019) and Wizard of Internet (Komeili et al., 2022). Even when human evaluation is used, evaluation is conducted only on familiar discussion topics. This leads to an overestimation of the capabilities of chatbots." They call such topics "head topics" ("Examples include Albert Einstein or FC Barcelona"). In contrast, the lesser known "tail topics [are] likely to be present in the pre-training data of LLMs at low frequency. Examples include Thomas Percy Hilditch or Hell's Kitchen Suomi". As a third category, they consider "recent topics" ("topics that happened in 2023, and therefore are absent from the pre-training corpus of LLMs, even though some background information about them could be present. Examples include Spare (memoir) or 2023 Australian Open"). The latter are obtained from a list of most edited Wikipedia articles in early 2023.

Regarding the "core verification problem [...] whether a claim is backed up by the retrieved paragraphs [the researchers] found that there is a significant gap between LLMs (even GPT-4) and human performance [...]. Therefore, we conduct human evaluation via crowdsourcing, to classify each claim as supported, refuted, or [not having] enough information." (This observation may be of interest regarding efforts to use LLMs as a tools for Wikipedians to check the integrity of citations on Wikipedia. See also the "WiCE" paper below.)

In contrast, the evalution for "conversationality" is conducted "with simulated users using LLMs. LLMs are good at simulating users: they have the general familiarity with world knowledge and know how users behave socially. They are free to occasionally hallucinate, make mistakes, and repeat or even contradict themselves, as human users sometimes do."

In the paper's evaluation, WikiChat impressively outperforms the two comparison baselines in all three topic areas (even the well-known "head" topics). It may be worth noting though that the comparison did not include widely used chatbots such as ChatGPT or Bing AI. Instead, the authors chose to compare their chatbot with Atlas (describing it as based on a retrieval-augmented language model that is "state-of-the-art [...] on the KILT benchmark") and GPT-3.5 (while ChatGPT is or has been based on GPT-3.5 too, it involved extensive additional finetuning by humans).

Briefly

 * Compiled by Tilman Bayer



Wikimedia Foundation launches experimental ChatGPT plugin for Wikipedia
As part of an effort "to understand how Wikimedia can become the essential infrastructure of free knowledge in a possible future state where AI transforms knowledge search", on July 13 the Wikimedia Foundation announced a new Wikipedia-based plugin for ChatGPT. (Such third-party plugins are currently available to all subscribers of ChatGPT Plus, OpenAI's paid variant of their chatbot; the Wikipedia plugin's code itself is available as open source.) The Foundation describes it as an experiment designed answer research questions such as "whether users of AI assistants like ChatGPT are interested in getting summaries of verifiable knowledge from Wikipedia".

The plugin works by first performing a Google site search on Wikipedia to find articles matching the user's query, and then passing the first few paragraphs of each article's text to ChatGPT, together with additional (hidden) instruction prompts on how the assistant should use them to generate an answer for the user (e.g. "In ALL responses, Assistant MUST always link to the Wikipedia articles used").

Wikimedia Foundation Research report
The Wikimedia Foundation's Research department has published its biannual activity report, covering the work of the department's 10 staff members as well as its contractors and formal collaborators during the first half of 2023.

New per-country pageview dataset
The Wikimedia Foundation announced the public release of "almost 8 years of pageview data, partitioned by country, project, and page", sanitized using differential privacy to protect sensitive information. See documentation

Wikimedia Research Showcase
See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer

Prompting ChatGPT to answer according to Wikipedia reduces hallucinations
From the abstract:  "Large Language Models (LLMs) may hallucinate and generate fake information, despite pre-training on factual data. Inspired by the journalistic device of 'according to sources', we propose according-to prompting: directing LLMs to ground responses against previously observed text. To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-produced answers are directly found in underlying text corpora. We illustrate with experiments on Wikipedia that these prompts improve grounding under our metrics, with the additional benefit of often improving end-task performance." The authors tested various variations of such "grounding prompts" (e.g. "As an expert editor for Wikipedia, I am confident in the following answer." or "I found some results for that on Wikipedia. Here’s a direct quote:"). The best performing prompt was "Respond to this question using only information that can be attributed to Wikipedia".

"Citations as Queries: Source Attribution Using Language Models as Rerankers"
From the abstract:  "This paper explores new methods for locating the sources used to write a text, by fine-tuning a variety of language models to rerank candidate sources. [...] We conduct experiments on two datasets, English Wikipedia and medieval Arabic historical writing, and employ a variety of retrieval and generation based reranking models. [...] We find that semisupervised methods can be nearly as effective as fully supervised methods while avoiding potentially costly span-level annotation of the target and source documents."

"WiCE: Real-World Entailment for Claims in Wikipedia"
From the abstract:  "We propose WiCE, a new textual entailment dataset centered around verifying claims in text, built on real-world claims and evidence in Wikipedia with fine-grained annotations. We collect sentences in Wikipedia that cite one or more webpages and annotate whether the content on those pages entails those sentences. Negative examples arise naturally, from slight misinterpretation of text to minor aspects of the sentence that are not attested in the evidence. Our annotations are over sub-sentence units of the hypothesis, decomposed automatically by GPT-3, each of which is labeled with a subset of evidence sentences from the source document. We show that real claims in our dataset involve challenging verification problems, and we benchmark existing approaches on this dataset. In addition, we show that reducing the complexity of claims by decomposing them by GPT-3 can improve entailment models' performance on various domains." The preprint gives the following examples of such an automatic decomposition performed by GPT-3 (using the prompt "Segment the following sentence into individual facts:" accompanied by several instructional examples):  Original Sentence:
 * The main altar houses a 17th-century fresco of figures interacting with the framed 13th century icon of the Madonna (1638), painted by Mario Balassi.

[Sub-claims predicted by GPT-3:]
 * The main altar houses a 17th-century fresco.
 * The fresco is of figures interacting with the framed 13th-century icon of the Madonna.
 * The icon of the Madonna was painted by Mario Balassi in 1638.

"SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages"
From the abstract:  "[...] we introduce the SWiPE dataset, which reconstructs the document-level editing process from English Wikipedia (EW) articles to paired Simple Wikipedia (SEW) articles. In contrast to prior work, SWiPE leverages the entire revision history when pairing pages in order to better identify simplification edits. We work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling more than 40,000 edits with proposed 19 categories. To scale our efforts, we propose several models to automatically label edits, achieving an F-1 score of up to 70.6, indicating that this is a tractable but challenging NLU [ Natural-language understanding] task."

"Descartes: Generating Short Descriptions of Wikipedia Articles"
From the abstract:  "we introduce the novel task of automatically generating short descriptions for Wikipedia articles and propose Descartes, a multilingual model for tackling it. Descartes integrates three sources of information to generate an article description in a target language: the text of the article in all its language versions, the already-existing descriptions (if any) of the article in other languages, and semantic type information obtained from a knowledge graph. We evaluate a Descartes model trained for handling 25 languages simultaneously, showing that it beats baselines (including a strong translation-based baseline) and performs on par with monolingual models tailored for specific languages. A human evaluation on three languages further shows that the quality of Descartes’s descriptions is largely indistinguishable from that of human-written descriptions; e.g., 91.3% of our English descriptions (vs. 92.1% of human-written descriptions) pass the bar for inclusion in Wikipedia, suggesting that Descartes is ready for production, with the potential to support human editors in filling a major gap in today’s Wikipedia across languages."

"WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs"
From the abstract: "In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. [...] [The autogenerated descriptions are preferred in] human evaluation in over 45.33% [cases] against the gold descriptions. [...] The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions." From the introduction:  "With the rapid development of Wikipedia and Wikidata in recent years, the editor community has been overloaded with contributing new information adapting to user requirements, and patrolling the massive content daily. Hence, the application of NLP and deep learning is key to solving these problems effectively. In this paper, we propose a summarization approach trained on WikiDes that generates missing descriptions in thousands of Wikidata items, which reduces human efforts and boosts content development faster. The summarizer is responsible for creating descriptions while humans toward a role in patrolling the text quality instead of starting everything from the beginning. Our work can be scalable to multilingualism, which takes a more positive impact on user experiences in searching for articles by short descriptions in many Wikimedia projects." See also the "Descartes" paper (above).

"Can Language Models Identify Wikipedia Articles with Readability and Style Issues?"
From the abstract:  "we investigate using GPT-2, a neural language model, to identify poorly written text in Wikipedia by ranking documents by their perplexity. We evaluated the properties of this ranking using human assessments of text quality, including readability, narrativity and language use. We demonstrate that GPT-2 perplexity scores correlate moderately to strongly with narrativity, but only weakly with reading comprehension scores. Importantly, the model reflects even small improvements to text as would be seen in Wikipedia edits. We conclude by highlighting that Wikipedia's featured articles counter-intuitively contain text with the highest perplexity scores."

"Wikibio: a Semantic Resource for the Intersectional Analysis of Biographical Events"
From the abstract:  "In this paper we [are] presenting a new corpus annotated for biographical event detection. The corpus, which includes 20 Wikipedia biographies, was compared with five existing corpora to train a model for the biographical event detection task. The model was able to detect all mentions of the target-entity in a biography with an F-score of 0.808 and the entity-related events with an F-score of 0.859. Finally, the model was used for performing an analysis of biases about women and non-Western people in Wikipedia biographies."

"Detecting Cross-Lingual Information Gaps in Wikipedia"
From the abstract:  "The proposed approach employs Latent Dirichlet Allocation (LDA) to analyze linked entities in a cross-lingual knowledge graph in order to determine topic distributions for Wikipedia articles in 28 languages. The distance between paired articles across language editions is then calculated. The potential applications of the proposed algorithm to detecting sources of information disparity in Wikipedia are discussed [...]" From the paper: <blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;"> "In this PhD project, leveraging the Wikidata Knowledge base, we aim to provide empirical evidence as well as theoretical grounding to address the following questions:
 * RQ1) How can we measure the information gap between different language editions of Wikipedia?
 * RQ2) What are the sources of the cross-lingual information gap in Wikipedia?

[...] The results revealed a correlation between stronger similarities [...] and languages spoken in countries with established historical or geographical connections, such as Russian/Ukrainian, Czech/Polish, and Spanish/Catalan."

"Wikidata: The Making Of"
From the abstract: <blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">"In this paper, we try to recount [Wikidata's] remarkable journey, and we review what has been accomplished, what has been given up on, and what is yet left to do for the future."

"Mining the History Sections of Wikipedia Articles on Science and Technology"
From the abstract: <blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;"> "Priority conflicts and the attribution of contributions to important scientific breakthroughs to individuals and groups play an important role in science, its governance, and evaluation.[....] Our objective is to transform Wikipedia into an accessible, traceable primary source for analyzing such debates. In this paper, we introduce Webis-WikiSciTech-23, a new corpus consisting of science and technology Wikipedia articles, focusing on the identification of their history sections. [...] The identification of passages covering the historical development of innovations is achieved by combining heuristics for section heading analysis and classifiers trained on a ground truth of articles with designated history sections."