Wikipedia:Wikipedia Signpost/2024-01-31/Recent research

A "lack of bureaucratic openness and rules constraining administrator behavior" enabled nationalist takeover of Croatian Wikipedia

 * Reviewed by Bri and Tilman Bayer

A paper titled "Governance Capture in a Self-Governing Community: A Qualitative Comparison of the Serbo-Croatian Wikipedias" (accepted for publication in the CSCW 2024 proceedings) examines the well-known case of the Croatian Wikipedia's hijacking by far-right nationalists (from at least 2011 to 2020), and asks why the similarly situated Serbian, Bosnian and Serbo-Croatian Wikipedias managed to escape this fate.

As summarized in a post by the University of Washington's Center for an Informed Public (an interdisciplinary center involving UW's Information School, School of Law, and Department of Human Centered Design & Engineering), on the Croatian Wikipedia

This has already been documented in detail in a report commissioned by the Wikimedia Foundation (see e.g. prior Signpost coverage: "Croatian Wikipedia: capture and release", Disinformation report, 2021-06-27 and "Wikimedia Foundation builds 'Knowledge Integrity Risk Observatory' to enable communities to monitor at-risk Wikipedias", Recent research, 2022-11-28). As summarized in the present paper, "In part, the [WMF's] report attributed Croatian Wikipedia’s capture to a unique situation in which there were distinct Wikipedia editions for the standardized national variants of a pluricentric language: Bosnian-Croatian Montenegrin-Serbian (BCMS), sometimes referred to as Serbo-Croatian. This explanation, however, raises the question of why Serbian and Bosnian Wikipedia did not appear to suffer Croatian [Wikipedia's] fate."

To answer this question, the authors focus in particular on the comparison with Serbian Wikipedia (the largest of the four BCMS language Wikipedias; a Montenegrin Wikipedia does not exist currently, whereas the Serbo-Croatian Wikipedia, while catering to all the national variants, was deemed to be a less attractive takeover target due to its smaller audience and lack of "national resonance"). Their findings point at weak policies and norms that allowed capture to happen, especially the lack of policies around blocking, and the importance of integrity amongst the community's bureaucrats (users who can grant and remove admin permissions).

The researchers used a grounded theory approach, specifically a "qualitative analysis of interview data with a range of participants in Croatian and Serbian Wikipedia and in the broader Wikipedia community" (15 interviews in total). Based on this,

The authors state that their paper is the first academic work they know of "that has considered how distributed influence operations target, become deeply engaged with, and are facilitated by institutional and organizational arrangements within peer production communities like Wikipedia".

Among the limitations acknowledged in the paper, "none of its authors are fluent BCMS speakers. As a result, interviews were conducted in English." However, they attempted to compensate for this potential loss of relevant interviewees by also examining policy-related talk page discussion using Google Translate.

Perhaps more seriously, while the paper's insights certainly deserve wide attention by everyone concerned with similar issues in the Wikimedia movement, they are based on a single case - the authors note "that Croatian Wikipedia reflects only one potential path." They point to the case of Chinese Wikipedia, where "infiltration concerns" had led the Wikimedia Foundation to ban several admins in 2021 (Signpost coverage), illustrating "government pressure" as an important additional factor that "Future research could extend our framework" with. However, the authors do not mention that  the Chinese Wikipedia case also provides important information relevant to factors that their paper did focus on and made conclusions about. For example, the Chinese Wikipedia community decided early on to build a single language project instead of separate ones for national variants of the Chinese language, aided by an (at the time) innovative automatic conversion system. As summarized in a 2009 paper,  "Chinese Wikipedia (CW) [...] has accommodated diverse Chinese-speaking contributors, despite the linguistic, regional, and political differences between four regions (Mainland China, Hong Kong/Macau, Taiwan, and Singapore/Malaysia). In the creation of CW, a technological polity was built by localizing Wikipedia’s governance principles, implementing Chinese character conversion, and establishing the “Anti-Regionalism Policy” (避免地域中心) [...an editorial policy that] addresses regional issues beyond those at the technolinguistic level. This policy does not exist in the English Wikipedia. An antidote to the current [2009] Chinese cyber-nationalism, the policy mandates that China-centric, Han-centric, and Chinese-centric statements should be avoided." One can't help wondering if a similar "anti-regionalism policy" could have been an effective "antidote" against Croatian nationalism, too, and whether using a similar technology-aided conversion between writing systems of Serbo-Croatian early on could have helped maintain Serbo-Croatian Wikipedia as a common locus of collaboration instead of being overtaken by the nationally focused Croatian and Serbian Wikipedia. (Both the Serbian and Serbo-Croatian Wikipedia did eventually adopt automatic conversion systems.) Unfortunately, the present interview study fails to address such questions.

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer

"Why do you need 400 photographs of 400 different Lockheed Constellations" on Commons?
From the abstract: We review prior studies of Commons-Based Peer Production (CBPP) identifying four common value dimensions previously noted as present in CBPP: usage value, social value, ideological value, and monetary value. We use this synthetic framework to analyze a dataset of 32 interviews with contributors to Wikimedia Commons and editors of Wikipedia who use Commons resources. Our analysis supports the prior values categories while expanding how some dimensions are expressed by participants. We also highlight four additional value dimensions that were not previously identified in CBPP: cultural heritage value, rarity value, aesthetic value, and administrative value." These 32 interviews are apparently the same as those that already served as the basis of an earlier, related paper by the same authors (cf. our review: "Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration").

"From academic to media capital: To what extent does the scientific reputation of universities translate into Wikipedia attention?"
From the abstract:  "[...] in most cases estimates of scientific reputation are based on composite or weighted indicators and absolute positions in university rankings. In this study, we adopt a more granular approach to assessment of universities' scientific performance using a multidimensional set of indicators from the Leiden Ranking and testing their individual effects on university [English] Wikipedia page views. We distinguish between international and local attention and find a positive association between research performance and Wikipedia attention which holds for regions and linguistic areas. Additional analysis shows that productivity, scientific impact, and international collaboration have a curvilinear effect on universities' Wikipedia attention. This finding suggests that there may be other factors than scientific reputation driving the general public's interest in universities."

"NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages"
Including loan words in a training corpus for natural language processing, a linguistic-computational technique closely interrelated with recent advances in artificial intelligence, can degrade the fidelity of the model that is supposed to represent the native language, not the language of the loan words. According to the authors, the relatively high fraction of loan words in Indonesian language Wikipedias (there are several) suffer from this defect. From a Twitter/X thread by one of the authors of this preprint:  "Scraped data such as from Wikipedia is vital for NLP, but how reliable is it in low-resource settings? [...] We explore 2 methods of building a corpus for 12 underrepresented Indonesian languages: by human translation, and by doing free-form paragraph writing given a theme. We then compare their quality vs Wikipedia text. [Compared to] Wikipedia data, both Nusa Translation (NusaT) and Nusa Paragraph (NusaP) are generally more lexically diverse and use fewer loan words. We also realize that apparently some of the Wikipedia pages for low-resource languages are mostly boilerplate. [...] To conclude: - We release NusaT and NusaP, high-quality corpus for 12 underrepresented languages - Underrepresented languages corpus from Wikipedia does not represent the true language distribution [...]"

"Loanword identification based on web resources: A case study on Wikipedia"
From the abstract: "To alleviate the resource scarcity and improve the robustness in loanword identification, the current study proposes a novel loanword identification method based on Wikipedia. In this paper, we first present how to obtain loanword candidate datasets and comparable corpora from Wikipedia. On the basis of these corpora, we develop a pseudo-data generation model for loanword identification tasks. And then we put forward a loanword identification model [...]" From the introduction: "In order to evaluate the performance of our method, we have applied it to different receipt languages (Uyghur, Chinese and English). Experimental results showed that the proposed method achieves the best performance compared with other baseline systems in all domains."

"Time Lag Analysis of Adding Scholarly References to English Wikipedia"
From the abstract:  "... [In] a time-series analysis of adding scholarly references to the English Wikipedia as of October 2021 [...] we detect no tendencies in Wikipedia articles created recently to refer to more fresh references because the time lag between publishing the scholarly articles and adding references of the corresponding paper to Wikipedia articles has remained generally constant over the years. In contrast, tendencies to decrease over time in the time lag between creating Wikipedia articles and adding the first scholarly references are observed. The percentage of cases where scholarly references were added simultaneously as Wikipedia articles are created is found to have increased over the years, particularly since 2007–2008. This trend can be seen as a response to the policy changes of the Wikipedia community at that time ..." See also:
 * excerpt
 * slide deck
 * Our review of an earlier, related paper by the same authors: "The first scholarly references on Wikipedia articles, and the editors who placed them there"

"Wikipedia as a tool for contemporary history of science: A case study on CRISPR
From the abstract: "Using a mixed-method approach, we qualitatively and quantitatively analyzed the CRISPR article’s text, sections and references, alongside 50 affiliated articles. These, we found, documented the CRISPR field’s maturation from a fundamental scientific discovery to a biotechnological revolution with vast social and cultural implications. We developed automated tools to support such research and demonstrated its applicability to two other scientific fields–coronavirus and circadian clocks."