Wikipedia:Wikipedia Signpost/2023-08-31/Recent research

"Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration"
The vast majority of academic research about Wikimedia projects continues to focus on Wikipedia and (in recent years) Wikidata. Publications about sister projects such as Wiktionary, Wikinews or Wikibooks exist, but are rare. A paper titled "Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration" is one of the very first social science publications to examine Wikimedia Commons (albeit still in tandem with Wikipedia). That's despite Commons being, as the authors highlight, "the world’s largest online repository of free multimedia files for anyone to contribute and use. To date, there are more than 10.5 million volunteers and over 77 million media files on Commons."

The term "stitching" in the paper's title refers to an existing concept from the field of CSCW (Computer-supported cooperative work). The authors define it as follows:

The paper examines in detail how these three cross-platform processes work in case of (English) Wikipedia and Wikimedia Commons, considering the role of Commons of hosting images and other media used on Wikipedia (and other Wikimedia projects). It is based an interview study with 32 Wikimedians working on both projects – from newcomers (<1k edits) to "highly active editors" – five of whom self-identified as women. Among many other examples of such Commons-Wikipedia stitching, they describe e.g. the cropping or retouching Commons images to make them more suitable for use on Wikipedia use, or aligning Commons categories with Wikipedia article names. (These contrast with activities that focus on only one of the projects – such as text editing on Wikipedia, and image uploading, image annotating, metadata tagging and categorizing on Commons. Regarding the latter, the authors observe "a large group of Commons focused editors who categorize images. Categories is 'the primary way to organize and find files on Commons'", quoting one of the interviewees.)

While much of this will come as no surprise to Wikimedians familiar with both projects, the paper's second research question provides food for thought to both the involved volunteer communities and the Wikimedia Foundation (or other actors interested in designing better collaboration features in this areas). Here, the authors identify five "barriers that inhibit effective stitching between Wikipedia and Wikimedia Commons", and propose some "design implications" that could mitigate them:

"Barrier 1: Lack of Communication Across Networks"
The paper observes the existence of

The authors (perhaps wisely) don't propose concrete solutions for this issue, but rather list a few "[p]rior studies in CSCW/HCI [ human-computer interaction] [that] investigated similar situations in which stakeholders of a design problem were distributed", and suggest that "WMF could explore these approaches to engage editors distributed across platforms in a participatory design process to address their communication needs."

"Barrier 2: Differing Perspectives"
Here, the paper discusses tensions arising from the differing self-perception of each project – Wikipedia as "reference" work vs. Wikimedia Commons as "collection". This manifests e.g. in the question of whether Commons should primarily be seen as a media repository in itself, or as infrastructure for other Wikimedia projects. Specifically, the authors note debates on whether it should aim to host more images of a subject than could conceivably be needed to illustrate pages on other projects: the paper opens with the example of a Wikipedia editor looking to illustrate the article on the Boeing 777 airplane and getting overwhelmed with the search results on Commons: "22,572 images for Boeing 777 with 5,686 categories and multiple pages created by different curators who work to sort images." (As a counterargument illustrating "the difficulties of judging the utility of Commons resources as a function of their use in other WMF platforms", another interviewee mentioned the example of a particular Boeing 777 becoming notable after an accident: "And suddenly, that photograph of that aeroplane we were hosting on Commons appeared in newspapers and all over the place, because it was the only freely available photograph they could find of that exact aircraft.")

As a solution to such issues, the authors (somewhat vaguely) suggest "a process similar to Wikipedia's notability voting. The process would enable editors from both platforms to figure out whether an image warrants significance in any contexts collaboratively, rather than relying on judgement of editors from one platform or the other."

"Barrier 3: Multilingual Resources"
The authors note that Commons is multilingual in theory (with many documentation pages, templates etc. being translatable and available in multiple language), but in practice mostly "produced and curated by English speakers". In particular, they call out the limitations of the search function:

The paper remarks that the WMF-led "Structured Data on Commons" project (launched in 2017 with a $3 million grant from the Alfred P. Sloan foundation) aims to improve this by incorporating multilingual information from Wikidata, but that it has "made little progress on Commons because many contributors simply did not know about it or did not care", or "preferred their 'own' [category-based] system over a new structure designed by the foundation". (Here it is worth noting that the study's interviews took place from April 2020 to January 2021, i.e. shortly before the default search interface was switched to "Media search" which is supposed to eventually integrate such structured information. However, as of this time – August 2023 – it retains the same limitations of text-based search.) As a possible way out of this conundrum, the authors suggest that

"Barrier 4: Cross-Platform Vandalism"
This issue mainly refers to the problem that vandals can overwrite an image on Commons to affect articles on Wikipedia, which is difficult to detect for Wikipedia editors using their existing monitoring processes. And on the other side, "Though Wikimedia Commons can track and detect when an image is overwritten, it is hard to evaluate the legitimacy of the overwrite because the context of reuse is unknown."

The authors note that this is partly a technical issue, as cross-platforms notifications could be implemented to alert Wikipedians of such incidents. However, they argue that "Even if these notifications existed, these platforms would need to collaborate on addressing the problem. In the general case, resolving barriers will require technical and social collaboration across or between platforms."

"Barrier 5: Differing Policies"
Here, the paper gives two examples. The first one is about copyright:

(Contrary to what the authors appear to imply here, the "Assume Good Faith" policy specifically clarifies that "When dealing with possible copyright violations, good faith means assuming that editors intend to comply with site policy and the law. That is different from assuming they have actually complied with either. Editors have a proactive obligation to document image uploads, etc. [...]")

As a second "misalignment", the authors calls out "the differences between Wikipedia and Commons reliance on data sources":

Here, the paper doesn't offer solutions, apart from the already mentioned general proposals to improve cross-platform and cross-network communications.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''

"Evolution of the Coordination of Activities Aimed at Building Knowledge in the Wikipedia Community"
From the abstract:  "This paper aims to characterize the variability in creating new concepts of cooperation in selected language versions of Wikipedia and identify the factors of participating in various forms of cooperation. [...] The research conducted was both qualitative and quantitative. A netnographic approach was used, as well as a statistical analysis of user activity records. Thanks to the netnographic research, the stages of Wikipedia’s evolution were identified. Quantitative research has shown a correlation between the number of activity areas (a user’s affiliation to WikiProjects) and their overall activity (the number of edits made). A change in Wikipedians’ activity style was also observed depending on their seniority on the website." (See also other recent publications co-authored by the same author: "Power Distance and Hierarchization in Organizing Virtual Knowledge Sharing in Wikipedia", "Wikipedia as a Space for Collective and Individualistic Knowledge Sharing")

"Quantifying the scientific revolution" using Wikipedia
From the abstract:  "The Scientific Revolution represents a turning point in the history of humanity. Yet it remains ill-understood, partly because of a lack of quantification. Here, we leverage large datasets of individual biographies (N = 22,943) and present the first estimates of scientific production during the late medieval and early modern period (1300–1850). [...] Finally, we investigate the interplay between economic development and cultural transmission (the so-called ‘Republic of Letters’) using partially observed Markov models imported from population biology. Surprisingly, the role of horizontal transmission (from one country to another) seems to have been marginal. Beyond the case of science, our results suggest that economic development is an important factor in the evolution of aspects of human culture."

From the paper:  "[...] we gathered the Wikipedia pages of all individuals classified as scientists during the early modern period: mathematicians, astronomers, physicists, biologists, naturalists, chemists, botanists, entomologists and zoologists (see Table S1). Then, we estimated the scientific contribution of each of these 6620 individuals through different proxies (the size of the page, the number of translations in other languages and the number of Wikipedia pages containing a link to this page). With such a large dataset, we can go beyond key figures such as Newton and Galileo, and take into account the thousands of little-known individuals who contributed to the rise of science "

"Comparison of metrics for measuring Wikipedia ecology: characteristics of self-consistent metrics for editor scatteredness and article complexity"
From the abstract:  "[...] To measure the quality of the editors and articles on Wikipedia, self-consistent metrics for the network defined by the edit relationship have been introduced previously. This scatteredness–complexity measure can evaluate the editors and articles more sensitively than the local characteristics such as degrees of the network and capture well the editors’ activity and the articles’ level of complexity. [...] In addition, the distributions of the editor scatteredness and article complexity become smoother when the network is randomized and loses its detailed local structure eliminating the correlation between the editors and articles. When the degree distributions of the editors or articles are changed and become uniform in the randomized network, the distributions of the editor scatteredness or article complexity become flatter, respectively. This results suggest that the scatteredness–complexity measure reflects not only the degree distribution of the editors or articles but also the local network structure."

"Why a Guideline for Spoilers? A comparison between Spoiler Guidelines, related user comments of Wikis, Newssites and Fan Forums"
This master's thesis includes a chapter about discussions on English Wikipedia. From the abstract:  The aim of this thesis was [...] to understand Spoiler Avoidance (SA) from an Information Avoidance (IA) view, treating it as an example of beneficial IA. [...] a number of Guidelines [about] spoilers were collected from Reddit, Fandom, multiple newssites, Wikipedia and Google. Results were found for multiple levels of abstraction. Firstly, spoiler guidelines exist due to difficulties in defining spoilers, different aims of websites and the different desires for users. Secondly it could be found that SA was not always assumed to be positive, but can be explained through many IA-theories. [...]