Wikipedia:Wikipedia Signpost/2020-11-01/Recent research

OpenSym 2020
The fifteenth edition of the annual OpenSym conference (originally known as WikiSym) took place as an online event on August 26–27, 2020. Pre-pandemic, it had been expected to be held as a physical event in Madrid, Spain, which is now envisaged to become the location of next year's OpenSym instead. The program included several papers about Wikipedia and Wikidata:

"Exploring Systematic Bias through Article Deletions on Wikipedia from a Behavioral Perspective"

 * Reviewed by OpenSexism

In "Exploring Systematic Bias through Article Deletions on Wikipedia from a Behavioral Perspective", the authors ask "Is content supposedly of more interest to women being actively deleted from Wikipedia?" To answer this question, they identify "a broad set of Wikipedia article pages that may have interest to a given gender" using a set of terms drawn from popular magazines whose declared readership is primarily men or women (terms were identified at two points in time, 2004 and 2014). The identified terms are then matched to Wikipedia articles to determine the most likely audience for each. (The authors include a list of identified terms in an appendix, where one can see whether men or women are associated with things like balsamic vinegar, bagel, bandage, biomedical engineering, dishwashing, and constipation.)

Once the Wikipedia content is matched to a demographic, the authors use Wikipedia's public deletion logs to collect deletion information. Comparing deletion rates, they find "no significant qualitative differences in the rates of AfD ["Articles for deletion (AfD) is where Wikipedians discuss whether an article should be deleted"] or CSD ["The criteria for speedy deletion (CSD) specify the only cases in which administrators have broad consensus to bypass deletion discussion, at their discretion, and immediately delete Wikipedia pages or media"] for articles of supposed interest to women compared to articles of supposed interest to men." Regarding the 2014 terms, they also note that from “our initial list of topics of supposed interest to women, about 31.9% of topics could not be matched to an article using the matching method, and 15.1% of terms of supposed interest to men were in the same condition. These represent potential content that is not currently in Wikipedia."

"To be fair," the authors note in the discussion, "there is more content of possible interest to women that was, likely, never included and therefore it is not possible for it to be deleted." This observation is also echoed in the brief section dedicated to biographies: "That is, biographies that might be of more interest to women are perhaps not being deleted or nominated for deletion simply because they are not there in the first place." The authors conclude by urging that "future work should be done around the more pernicious ways that system bias is reinforced."

Recommended reading: "Wikipedia Has a Bias Problem" by Jackie Koerner, see also the book review in this Signpost issue



"Who Writes Wikipedia? An Investigation from the Perspective of Ortega and Newton Hypotheses"

 * Reviewed by Tilman Bayer

It has long been debated whether Wikipedia's success rests more on the work of a small core of highly active editors, or the infrequent contributions of a large number of casual editors. (One such discussion took place in in 2006 between Wikipedia founder Jimmy Wales and Aaron Swartz, then a Wikimedia Foundation board of trustees candidate, and has been referred to in context of subsequent research.)

This paper (which alludes to Swartz' 2006 analysis in its title) presents an extensive data analysis aimed at illuminating this question, framed using concepts from the sociology of science: The "Ortega hypothesis" (named after José Ortega y Gasset, author of The Revolt of the Masses), which posits that the progress of science mainly rests on many smaller contributions, versus the "Newton hypothesis" which instead asserts that it's rather a small number of genius scientists who are advancing science (named after Isaac Newton's dictum that he "stood on the shoulders of giants"). To quote the "conclusion" section:  "The study visits the prevalent belief that only top 1% of the users in peer-production communities are sufficient for running the system, as proclaimed by the existing rules such as 1–9–90 rule and Newton Hypothesis. The analysis highlights that in Wikipedia, the masses who interact with the portal very infrequently, are also required in the system for their small but useful pieces of contribution in bringing new pieces of knowledge to the articles. The results endorse the claims of the Ortega hypothesis in Wikipedia and recommend examining and reconsidering system policies made solely based on Newton Hypothesis."

The analysis is based on the revision history of the 100 most edited articles on the English Wikipedia, examined using the KDAP tool (which was co-developed by one of the authors and presented in a separate paper at the conference, see below). The contributions of masses and elites are examined through three research questions:
 * 1) focused on what new content they insert – measured e.g. by the number of words, images, references, or wikilinks (interpreted as "factoids") contributed
 * 2) "the proportion of good content contribution by masses" as opposed to vandalism, based on ORES content scores
 * 3) "activities involving up-keeping of the existing content such as restructuring and formatting of the content", defined as edits that either removed content, changed "positioning of the existing content" without adding new content, or introduced formatting changes such as wikilinks, external links or bolding text

The paper's data analysis is much more detailed and sophisticated than e.g. Swartz' brief 2006 study, but also involves some choices that cast doubt on the interpretation of the results, or at least rely on a definition of "community" that is quite different from those usually used in research and discussion about collaboration on Wikipedia. In particular, "mass" and elite" are defined per article, using edit count percentiles, rather than via an editor's contributions and experience on Wikipedia overall. The authors briefly acknowledge this limitation:  [... ] the analysis considers each Wikipedia article as a standalone community. Therefore, there may be cases where a user is making a large number of edits overall, but very few edits in the article under consideration. This might reflect a need to examine the entire English Wikipedia to judge each user’s overall contribution, thereby, considering the English Wikipedia to be a community ..." (Contrary to what the paper implies, the latter was also the approach used by Swartz in 2006, who defined elite users using their overall edit count on the entire site.) Another open question is how representative the 100 most edited articles are for Wikipedia's entire content of over 6 million articles.

Overlooked in Wikipedia research so far: Edit filters
From the abstract:  "The Wikipedia community [...] has built a sophisticated set of automated, semi-automated, and manual quality assurance mechanisms over the last fifteen years. The scientific community has systematically studied these mechanisms but one mechanism has been overlooked — edit filters. Edit filters are syntactic rules that assess incoming edits, file uploads or account creations. As opposed to many other quality assurance mechanisms, edit filters are effective before a new revision is stored in the online encyclopaedia. In the exploratory study presented, we describe the role of edit filters in Wikipedia’s quality assurance system. We examine how edit filters work, describe how the community governs their creation and maintenance, and look into the tasks these filters take over."

A platform to analyze "collaborative knowledge building portals" such as Wikipedia
From the abstract and paper:  "We describe Knowledge Data Analysis and Processing Platform (KDAP), a programming toolkit that is easy to use and provides high-level operations for analysis of knowledge data. We propose Knowledge Markup Language (Knol-ML), a standard representation format for the data of collaborative knowledge building portals. KDAP can process the massive data of crowdsourced portals like Wikipedia and Stack Overflow efficiently. As a part of this toolkit, a data-dump of various collaborative knowledge building portals is published in Knol-ML format." "Various tools and libraries have been developed to analyze Wikipedia data. Most of these tools extract the data in real-time to answer questions. A common example of such a tool is the web-based Wikipedia API. [...] However, the downside of using a web-based API is that a particular revision has to be requested from the service, transferred over the Internet, and then stored locally in an appropriate format. [...] Apart from web-based services, tools to extract and parse the Wikipedia data dump are available [but are, in the opinion of the authors, either too limited to specific use cases or "restricted to basic data"]." (The tool was used for the "Who Writes Wikipedia? ..." paper also presented at the conference, see review above)

Predicting an article's quality rating based on editor collaboration patterns
From the abstract:  "We present a novel model for classifying the quality of Wikipedia articles based on structural properties of a network representation of the article’s revision history. We create revision history networks [...] where nodes correspond to individual editors of an article, and edges join the authors of consecutive revisions. Using descriptive statistics generated from these networks, along with general properties like the number of edits and article size, we predict which of six quality classes (Start, Stub, C-Class, B-Class, Good, Featured) articles belong to, attaining a classification accuracy of 49.35% on a stratified sample of articles."

"Dynamics of Edit War Sequences in Wikipedia"
From the abstract:  From the abstract: "we perform a systematic analysis of the conflicts present in 1,208 controversial articles of Wikipedia captured in the form of edit war sequences. We examine various key characteristics of these sequences and further use them to estimate the outcome of the edit wars. The study indicates the possibility of devising automated coordination mechanisms for handling conflicts in collaborative spaces."

"The Wikipedia Diversity Observatory: A Project to Identify and Bridge Content Gaps in Wikipedia"
From the abstract:  "... we present the Wikipedia Diversity Observatory, a project aimed to increase diversity within Wikipedia language editions. The project includes dashboards with visualizations and tools which show the gaps in terms of concepts not represented or not shared across languages. The dashboards are built on datasets generated for each of the more than 300 language editions, with features that label each article according to different categories relevant to overall content diversity." See also earlier coverage: "Wikidata calculates cultural diversity"

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer

"The Falklands/Malvinas war taken to the Wikipedia realm: a multimodal discourse analysis of cross-lingual violations of the Neutral Point of View"
From the abstract:  "Despite the copious amount of literature on neutrality in Wikipedia, little research has yet applied multimodal discourse analysis to tackle cross-lingual violations of the Neutral Point of View (NPOV). Consequently, this study draws on selected visual and textual data from the English and Spanish Wikipedia entries for the Falklands/Malvinas War to prove that the inclusion of certain images and lexemes in particular contexts can be good indicators of NPOV violations. The data set used in the research consisted of the introductory sections, table of contents and images from the two Wikipedia entries and a set of selected comments posted on their talk pages. The findings suggest that specific lexical and visual choices are ideologically motivated and go against the principles advocated by NPOV. This is further attested by the fact that some lexical choices are contested by Wikipedia editors on the talk pages ..."

"Identifying Cultural Differences through Multi-Lingual Wikipedia"
From the abstract:  "We present a computational approach to learn cultural models that encode the general opinions and values of cultures from multi-lingual Wikipedia. Specifically, we assume a language is a symbol of a culture and different languages represent different cultures. Our model can automatically identify statements that potentially reflect cultural differences. Experiments on English and Chinese languages show that on a held out set of diverse topics, including marriage, gun control, democracy, etc., our model achieves high correlation with human judgements regarding within-culture values and cultural differences.

"Towards Extending Wikipedia with Bidirectional Links"
From the abstract:  "A WikiLinks system extends the Wikipedia with bidirectional links between fragments of articles. However, there were several attempts to introduce bidirectional fragment-fragment links to the Web, WikiLinks project is the first attempt to bring the new linkage mechanism directly to Wikipedia ..."

"Ripples on the web: Spreading lake information via Wikipedia"
From the article (which is lacking an abstract):  "[...] we argue that Wikipedia is an ideal venue for centralizing and improving the availability of facts about lakes. We give a brief overview of lake information on Wikipedia, how to contribute to it, and our vision for the broader dissemination of lake information. [...] Over 18,000 English Wikipedia articles exist for lakes and over 700 English Wikipedia articles exist describing aquatic processes in lakes. These articles reach a wide audience as they collectively have over 200,000 views per day."

"An encyclopedia for stock markets? Wikipedia searches and stock returns"
From the abstract:  "We present empirical evidence that collective investor behavior can be inferred from large-scale Wikipedia search data for individual-level stocks. [...] we quantify the statistical information flow between daily company-specific Wikipedia searches and stock returns for a sample of 447 stocks from 2008 to 2017. The resulting stock-wise measures on information transmission are then used as a signal within a hypothetical trading strategy."

"The Most Important Laboratory for Social Scientific and Computing Research in History"
A chapter in the "Wikipedia @ 20" book looks at how scholars have studied Wikipedia in the first two decades of its existence, coming to  "... one overarching conclusion: Wikipedia has become part of the mainstream of every social and computational research field we know of. Some areas of study, such as the analysis of human computer interaction, knowledge management, information systems, and online communication, have undergone profound shifts in the past twenty years that have been driven by Wikipedia research." While not a comprehensive literature review per se, the paper provides a bird's eye view, identifying the following main research areas: "Wikipedia as a Source of Data", "The Gender Gap", "Content Quality and Integrity", "Wikipedia and Education", "Viewership", "Organization and Governance", and "Wikipedia in the World".

See also the more extensive review of this and other chapters of the book in this Signpost issue