Wikipedia:Wikipedia Signpost/2023-06-19/Recent research

New articles about currently popular topics are more likely to be hoaxes
A study titled "The role of online attention in the supply of disinformation in Wikipedia" finds that  [...] compared to legitimate articles created on the same day, hoaxes [on English Wikipedia] tend to be more associated with traffic spikes preceding their creation. This is consistent with the idea that the supply of false or misleading information on a topic is driven by the attention it receives. [... a finding that the authors hope] could help promote the integrity of knowledge on Wikipedia.



The authors remark that "little is known about Wikipedia hoaxes" in the research literature, with only one previous paper focusing on this topic (Kumar et al., who among other findings had reported in 2016 that "while most hoaxes are detected quickly and have little impact on Wikipedia, a small number of hoaxes survive long and are well cited across the Web"). In contrast to that earlier study (which had access to the presumably more extensive new pages patrol data about deleted articles), the authors base their analysis on the list of hoaxes curated by the community.

Before getting to their main research question about the relation between online attention to a topic (operationalized using Wikipedia pageviews) and disinformation, the authors analyze how these confirmed hoax articles (in their initial revision) differ in various content features from a cohort of non-hoax articles started on the same day, concluding that  [...] hoaxes tend to have more plain text than legitimate articles and fewer links to external web pages outside of Wikipedia. This means that non-hoax articles, in general, contain more references to links residing outside Wikipedia. Such behavior is expected as a hoax's author would need to put a significant effort into crafting external resources at which the hoax can point. In other words, successful hoaxers might be displaying an anti-FUTON bias.

To quantify the attention of a topic area that an article pertains to (even before it is created), the authors use its "wiki-link neighbors":  The presence of a link between two Wikipedia entries is an indication that they are semantically related. Therefore, traffic to these neighbors gives us a rough measure of the level of online attention to a topic before a new [Wikipedia article] is created.

 To understand the nature of the relationship between the creation of hoaxes and the attention their respective topics receive, we first seek to quantify the relative volume change [in attention] before and after this creation day. Here, a topic is defined as all of the (non-hoax) neighbors linked within the contents of an article i.e., its (non-hoax) out-links. This "volume change" is calculated as the change in median pageviews among the neighboring articles, using the timespans from 7 days before and 7 days after the creation of an article, to account for "short spikes in attention and weekly changes in traffic" (a somewhat simplistic way of handling this kind of time-series analysis, compared to some earlier research on Wikipedia traffic). The authors limited themselves to an older pageviews dataset, that only covers 2007 to 2016, reducing their sample for this part of the analysis to 83 hoaxes. 75 of those exhibited a greater attention change than their cohort (of non-hoax articles started on the same day). Despite the relatively small sample size, this finding was judged to be statistically significant (more precisely, the authors find "a bootstrapped 95% confidence interval of (0.1227, 0.1234)" for the difference between hoax and non-hoax articles, far away from zero). In conclusion, this "indicates that the generation of hoaxes in Wikipedia is associated with prior consumption of information, in the form of online attention."

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Andreas Kolbe and Tilman Bayer

"Information literacy in South Korea: similarities and differences between Korean and international students' research trajectories"
From the abstract: Work on students' information literacy and research trajectories is usually based on studies of Western, English-speaking students. South Korea presents an opportunity to investigate an environment where Internet penetration is very high, but local Internet users operate in a different digital ecosystem than in the West, with services such as Google and Wikipedia being less popular. [...] We find that Korean students use Wikipedia but less so than their peers from other countries, despite their recognition that Wikipedia is more reliable and comprehensive than the alternatives. Their preferences are instead affected by their perception of Wikipedia as providing an inferior user experience and less local content than competing, commercial services, which also benefit from better search engine result placement in Naver, the search engine dominating the Korean market.

"A Wiki-Based Dataset of Military Operations with Novel Strategic Technologies (MONSTr)"
From the abstract:

 This paper introduces a comprehensive dataset on the universe of United States military operations from 1989 to 2021 from a single source: Wikipedia. Using automated extraction techniques on its two structured knowledge databases – Wikidata and DBpedia – we uncover information about individual operations within nearly every post-1989 military intervention described in existing academic datasets. The data we introduce offers unprecedented coverage and granularity that enables analysis of myriad factors associated with when, where, and how the United States employs military force. We describe the data collection process, demonstrate its contents and validity, and discuss its potential applications to existing theories about force employment and strategy in war.

"European Wikipedia platforms, sharing economy and national differences in participation: a case study"
From the abstract:  The following exploratory study considers which macro-level factors can lead to the sharing economy being more popular in certain countries and less so in others. An example of commons-based peer production in the form of the level of contributing to Wikipedia in 24 European countries is used as a proxy for participation in the sharing economy. Demographic variables (number of native speakers, a proxy for population) and development ones (Human Development Index-level and internet penetration) are found to be less significant than cultural values (particularly self-expression and secular-rational axes of the Inglehart-Welzel model). Three clusters of countries are identified, with the Scandinavian/Baltic/Protestant countries being roughly five times as productive as the Eastern Europe/Balkans/Orthodox ones.

"Wikipedia, or: the discreet neutralisation of antiracism. A look behind-the-scenes of an encyclopedic giant"
From the abstract (translated):  Journalist and anti-racist activist, known in particular for her work on police crimes, Sihame Assbague has observed the evolution of her own [French] Wikipedia biography. Noticing errors, approximations, sometimes harmless, often intended to harm her, she notes the virtual impossibility of rectifying them — the Wikipedian community asks for sources that are impossible to provide and detractors block her modifications. She then decided to investigate the functioning of the "free encyclopedia", whose watchwords are: freedom, self-management, transparency and neutrality. But are these principles applicable in the same way to the article "Leek" as to the page of an anti-racist campaigner? What does it mean to be "neutral" about "systemic racism" or "anti-white racism"? And who writes these notices? Immerse yourself in the inner workings of the world's largest encyclopedia.

"Between News and History: Identifying Networked Topics of Collective Attention on Wikipedia"
From the abstract:  [...] how is information on and attention towards current events integrated into the existing topical structures of Wikipedia? To address this [question] we develop a temporal community detection approach towards topic detection that takes into account both short term dynamics of attention as well as long term article network structures. We apply this method to a dataset of one year of current events on Wikipedia to identify clusters distinct from those that would be found solely from page view time series correlations or static network structure. We are able to resolve the topics that more strongly reflect unfolding current events vs more established knowledge by the relative importance of collective attention dynamics vs link structures. We also offer important developments by identifying and describing the emergent topics on Wikipedia.