Wikipedia:Wikipedia Signpost/2022-10-31/Recent research

Think tank publishes report on "Information Warfare and Wikipedia"
The Institute for Strategic Dialogue, a London-based think tank, earlier this month published a report (co-authored with a company called CASM Technology) focusing "on information warfare on Wikipedia about the invasion of Ukraine"; see also this issue's "In the media" (summarizing media coverage of the report) and "Disinformation report" (providing context in form of various other concrete cases).

As summarized in the abstract:  "The report combines a literature review on publicly available research and information around Wikipedia, expert interviews and a case study. For the case study, the English-language Wikipedia page for the Russo-Ukrainian war was chosen, where accounts that edited the page and have subsequently been blocked from editing were examined. Their editing behaviour on other Wikipedia pages was mapped to understand the scale and overlap of contributions. This network mapping has seemed to identify a particular strategy used by bad actors of dividing edits on similar pages across a number of accounts in order to evade detection. Researchers then tested an approach of filtering edits by blocked editors based on whether they add references to state-media affiliated or sponsored sites, and found that a number of edits exhibited narratives consistent with Kremlin-sponsored information warfare. Based on this, researchers were able to identify a number of other Wikipedia pages where blocked editors introduced state-affiliated domains [...]" The report offers a great overview of Wikipedia's existing mechanisms for dealing with such issues, based on numerous conversations with community members and other experts. However, the literature review  indicates that the authors – despite confidently telling Wired magazine "We've never tried to analyze Wikipedia data in that way before" – were unfamiliar with a lot of existing academic research (e.g. about finding alternative accounts, aka sockpuppets, of abusive editors); the 39 references cited in the report include only a single peer-reviewed research paper. Likewise, despite the hope that their findings could yield "new tools" (Wired) that would support combating disinformation on Wikipedia, there is no indication that the authors were aware of past and ongoing research-supported product development efforts to build such tools, by the Wikimedia Foundation and others, some of which are outlined below. On Twitter, the lead author stated that "We're going to be doing more research on information warfare on Wikipedia with a new project kicking off later this month [October]", so perhaps some of these gaps can still be bridged.

How existing research efforts could help editors fight disinformation
Exactly two years ago, in the run-up to the 2020 US elections, the Wikimedia Foundation published a blog post noting concerns about a "rising rate and sophistication of disinformation campaigns" on the internet by coordinated actors, about elections and other topics such as the global pandemic or climate change, and providing a summary of how Wikipedia specifically was addressing such threats.

After mentioning the volunteer community's "robust mechanisms and editorial guidelines that have made the site one of the most trusted sources of information online" and announcing an internal anti-disinformation task force at the Foundation (which reportedly still exists, although one former member recently stated they were unaware what its current work areas are) as well as "strengthened capacity building by creating several new positions, including anti-disinformation director and research scientist roles," the post focused on summarizing how  "the Foundation's research team, in collaboration with multiple universities around the world, delivered a suite of new research projects that examined how disinformation could manifest on the site. The insights from the research led to the product development of new human-centered machine learning services that enhance the community's oversight of the projects. These algorithms support editors in tasks such as detecting unsourced statements on Wikipedia and identify malicious edits and behavior trends.

With the US mid-term elections imminent and independent researchers apparently being unaware of these research projects at the Foundation (see above), now seems a good time to take a look at how they have developed in the meantime. As "some of the tools used or soon available to be used by editors", the October 2020 post listed the following:


 * An algorithm that identifies unsourced statements or edits that require citation. The algorithm surfaces unverified statements; it helps editors decide if the sentence needs a citation, and, in return, human editors improve the algorithm’s deep learning ability.
 * According to the corresponding project page on Meta-wiki, this project (launched in 2017) is still in progress, but it already resulted in a paper presented at the 2019 World Wide Web Conference ("Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability"). The project page mentions plans "to work in close contact with the [developers of Citation Hunt, an existing community tool that identifies unsourced statements for editors to fix] and the Wikipedia Library communities. We will pilot a set of a recommendations, powered by the new citation context dataset, to evaluate if our classifiers can help support community efforts to address the problem of unsourced statements."
 * (See also "Improving Wikipedia Verifiability with AI" and "Countering Disinformation by Finding Reliable Sources", below)


 * Algorithms to help community experts to identify accounts that may be linked to suspected sockpuppet accounts.
 * This project resulted in a working prototype by December 2020, but it appears not yet to have been put into production and made available as a tool for the intended audience (checkuser community members). It had been preceded by earlier research efforts at the Foundation as well as various independent academic publications that also tackled this detection problem.


 * A machine learning system to detect inconsistencies across Wikipedia and Wikidata, helping editors to spot contradictory content across different Wikimedia projects.
 * This project (a collaboration with researchers from Koreas's KAIST, the Chinese University of Hong Kong, and Taiwan's NCKU) has since completed, resulting in a dataset that aligned Wikipedia and Wikidata statements using natural language processing (NLP) techniques. The project documentation on Meta-wiki doesn't mention an implementation of the system that would be directly usable by editors.


 * A daily report of articles that have recently received a high volume of traffic from social media platforms. The report helps editors detect trends that may lead to spikes of vandalism on Wikipedia helping them identify and respond faster.
 * This report (English Wikipedia version: User:HostBot/Social media traffic report) was made available to Wikipedia editors as a pilot project in spring 2020, which concluded at the end of 2021. (It implemented a recommndation from a 2019 report about "Patrolling on Wikipedia".) The research project page on Meta-wiki lists several conclusions, including:
 * "the organic traffic coming from external platforms like YouTube and Facebook that link to Wikipedia articles as context for examining the credibility of content is not having a significant deleterious impact on Wikipedia or placing an additional burden on patrollers." (mitigating earlier concerns that had been voiced by the Wikimedia Foundation and others when YouTube introduced these links back in 2018))
 * " early evidence suggests that the Social Media Traffic Report as an intervention has not led to a substantial change in patrolling behavior around these articles"

Furthermore, the 2020 post mentioned the (at that time already widely used) ORES system.

The efforts appear to be part of the WMF Research team's "knowledge integrity" focus, announced in February 2019 in one of four "white papers that outline our plans and priorities for the next 5 years".

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
 * The Wikimedia Foundation's Research and Security teams are requesting input from researchers "to help prioritize the release of data that can be useful for your research", such as country-wise pageview and editor numbers.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''

Facebook/Meta research on "Improving Wikipedia Verifiability with AI"
From the abstract:  "We develop a neural network based system, called Side [demo available at https://verifier.sideeditor.com/ ], to identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web. We train this model on existing Wikipedia references, therefore learning from the contributions and combined wisdom of thousands of Wikipedia editors. Using crowd-sourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system's suggested alternatives compared to the originally cited reference 70% of the time. To validate the applicability of our system, we built a demo to engage with the English-speaking Wikipedia community and find that Side's first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims according to Side. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia." See also research project page on Meta-wiki

"Countering Disinformation by Finding Reliable Sources: a Citation-Based Approach"
From the abstract:  "Given a one sentence claim, the challenge is to automatically find a knowledge source (e.g. a book, a research article, a web page) that could support or refute the claim. We show that this capability could be learnt by observing associations between sentences in English Wikipedia and citations provided for them. Thus, we collect a corpus of over 50 million references to 24 million identified sources with the citation context from Wikipedia, and build search indices using several meaning representation methods."

New book: A Discursive Perspective on Wikipedia: More than an Encyclopaedia?
From the publisher's description::  "This book provides a concise yet comprehensive guide to Wikipedia for researchers and students of linguistics, discourse and communication studies, redressing the gap in research on Wikipedia in these fields and encouraging scholars to explore Wikipedia further as a platform and a medium. Drawing on [Susan] Herring's situational and medium factors [in computer-mediated communication], as well as related developments in (critical) discourse studies, the author studies the online encyclopaedia both theoretically and empirically, examining its origins, production and consumption before turning to a discussion of its societal significance and function(s)."

"Russian Wikipedia vs Great Russian Encyclopedia: (Re)construction of Soviet Music in the Post-Soviet Internet Space"
From the abstract:  "The referential texts in the Russian Wikipedia and the Great Russian Encyclopedia [...] were selected as examples for the analysis. A comparative analysis of articles on music and the composers who lived and worked in the USSR (including Sergei Prokofiev, Dmitri Shostakovich, Dmitri Kabalevsky, Tikhon Khrennikov, Boris Asafiev, Isaak Dunaevsky, Georgy Sviridov, Aram Khachaturian, Sofia Gubaidulina and Alfred Schnittke) displayed a number of regularities: emphasizing previously unknown areas of music of that period ("avant-garde music", "repressed music"), replacement or disregard towards the epithet "Soviet" regarding musical phenomena and composers, and the absence of any nostalgia for Soviet musical culture in modern receptions."