Wikipedia:Wikipedia Signpost/2023-11-20/Recent research

"Canceling Disputes: How Social Capital Affects the Arbitration of Disputes on Wikipedia"

 * Reviewed by Bri

This provocative paper in Law and Social Inquiry by a socio-legal scholar shows, through research mostly based on interviews with Wikipedia insiders, that the Arbitration Committee functions to cancel disputes, not to arbitrate to a compromise position, nor to reach a negotiated settlement, nor to actively promote truthful content (which one might naïvely have inferred from the name of the Committee).

Some of the arguments used in the paper are both arresting and concerning. This reviewer found the interpretive language, and the often verbatim quotes of people involved in the arbitration process — often deeply involved, including at least one described as a member of the Committee — more compelling than the light data analysis included in the paper. The author interviewed 28 editors: current and former members of the Committee, those who have been involved parties, those who have commented on cases, and those "who have knowledge of the dispute resolution process due to their long-standing involvement with Wikipedia" (not further defined).

The data analysis consisted of a breakdown of sanction severities against edit count (as a proxy for social capital). It found a negative correlation between social capital and severity, by examining edit count against light severity outcomes (admonishment) and heavy severity (up to and including site bans); see figure 2 above. The author presented two potential interpretations: one, the conventional one, that more mature and upstanding editors with deep social capital were more likely to obey norms; the other, that those editors with the social capital were free to disobey norms without severe consequences because of the wiki's empowerment of bad behavior through various means. In essence, this would validate the idea of a "cabal", or that a "too essential to be lost" mentality endows a "wiki aristocracy" capable of creating either true consensus or promoting their "version of the truth", to quote the paper (p. 15). It was this non-data-driven approach that attempted to find which of the competing theories was correct.

The key idea in the paper is that social capital — largely built up and represented by an editor's edit count regardless of their ability to peacefully coexist with other editors — is the most important factor when it comes to arbitration. The committee's purpose is to quash disputes in order for editing to continue, not to reach a "just" outcome in some broader sense. One way the social capital is expressed and brought to bear is essentially in the opening phases of an arbitration case, called preliminary statements. If one reads between the lines of the paper, the outcome is frequently predetermined by these opening phases and all that the committee can do is go along with the crowd. In fact, it is explicitly stated — again based on evidence gathered from insiders — that cases are frequently orchestrated off-wiki precisely in order to stack the deck against the other side.

Sadly for Wikipedians, the author concludes that it is the Machiavellian use of power that holds true on Wikipedia, or in other words, that there is a cabal. One passage that comes across as especially skeptical of this structure is found on p. 17: "an editor compared the Arbitration Committee to 'riot cops' ... [who] can be compared to the 'repressive peacemakers' ... guaranteeing the level of social peace that is necessary for the Wikipedia project to unfold, even to the detriment of fairness." Then the author appears to equate the arbitration process to a trial by ordeal, a feudal concept eschewed by the West in favor of due process based legal proceedings, further saying that

Summing up on the next page:

In other words, a system that puts the powerful above the law.

15% of datasets for fine-tuning language models use Wikipedia

 * Reviewed by Tilman Bayer

A new preprint titled "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI" presents results from "a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace [...] the data lineage of 44 of the most widely used and adopted text data collections, spanning 1800+ finetuning datasets" that have been published on platforms such as Hugging Face or GitHub. The authors make their resulting annotated dataset of annotated datasets available online, searchable via a "Data Provenance Explorer".

The paper presents various quantitative results based on this dataset. wikipedia.org was found to be the most widely used source domain, occurring in 14.9% (p. 14) or 14.6% (Table 4, p. 13) of the 1800+ datasets. This result illustrates the value Wikipedia provides for AI (although it also means, conversely, that over 85% of those datasets made no use of Wikipedia).

The paper highlights the following example of such a dataset that used Wikipedia:  Surpervised Dataset Example: SQuAD

Rajpurkar et al. (2016) present a prototypical supervised dataset on reading comprehension. To create the dataset, the authors take paragraph-long excerpts from 539 popular Wikipedia articles and hire crowd-source workers to generate over 100,000 questions whose answers are contained in the excerpt. For example:

Wikipedia Excerpt In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity.

Worker-generated question: What causes precipitation to fall? Answer: Gravity

Here the authors use Wikipedia text as a basis for their data and their dataset contains 100,000 new question-answer pairs based on these texts.

The bulk of the paper is of less interest to Wikimedians specifically, focusing instead on general questions about the sourcing information about these datasets ("we are in the midst of a crisis in dataset provenance") and their licenses (observing e.g. "sharp divides in composition and focus of commercially open vs. closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data"). An extensive "Legal Discussion" section acknowledges that the paper leaves out "several important related questions on the use of copyrighted works to create supervised datasets and on the copyrightability of training datasets." In particular, it does not examine whether the Wikipedia-based datasets satisfy the requirements of Wikipedia's CC BY-SA license. Regarding the use of CC-licensed datasets in AI in general, the authors note: "One of the challenges is that licenses like the Apache and the Creative Commons outline restrictions related to 'derivative' or 'adapted works' but it remains unclear if a trained model should be classified as a derivative work." They also remind readers that "In the U.S., the fair use exception may allow models to be trained on protected works," although "the application of fair use in the context is still evolving and several of these issues are currently being litigated".

(The datasets examined in the paper are to be distinguished from the much larger unlabeled text corpuses used for the initial unsupervised training of large language models (LLMs). There, Wikipedia is also known to have been used, alongside other sources such as Common Crawl, e.g. for the GPT-3 family that formed the basis of ChatGPT.)

Wikipedia biggest "loser" in recent Google Search update
A blog post by Search Engine Optimization firm Amsive (recommended as "extensive (and fascinating) research" in a recent The Verge feature about the SEO industry) analyzes the impact of an August 2023 "core update" by Google Search. The post explains that  Google [...] announced a new signal in its December updates to the Search Quality Rater guidelines: “E” for experience. The “E” is a new member of the E-A-T family, now called E-E-A-T, and stands for experience, expertise, authoritativeness, and trustworthiness. According to Google, the amount of E-E-A-T required for a page or site to be considered high-quality depends on the nature of the content and the extent to which it can cause harm to users. [...] Search Quality Raters have been working off this new version of the Quality Guidelines to review the quality of Google’s results and evaluate E-E-A-T for 9 months now, giving Google plenty of time to update its algorithms with the feedback provided by quality raters." The analysis of Google's August update focuses on "the list of the top 1,000 winners and losers in both absolute and percentage terms, using Sistrix Visibility Index scores using the Google.com U.S. index." (Sistrix' - generally not freely available - index is calculated based on search results for one million keywords, weighted by search volume and estimated click probability, and aggregated by domain.)

wikipedia.org tops the "Absolute Losers" list for Google's August 2023 update, with a larger score decrease than youtube.com (#2) and amazon.com (#3). Still, in relative terms, Wikipedia's score decline of -6.75% doesn't even make the "Percent Losers" list of the 250 sites with the biggest percentage declines. And in better news for Wikimedians, wiktionary.org ranked #3 on "Absolute Winners" list (right before britannica.com at #4). wikivoyage.org also gained, reaching #38 on the same list (with an index increase that is 37.38% in relative terms). What's more, Amsive's similar analysis of Google's preceding March 2023 core update, which had been "highly anticipated given the significant changes affecting organic search" in the preceding months, of which the EEAT announcement was just one, wikipedia.org had conversely topped the "Absolute Winners" list, with a 10.16% relative increase. Then again, back then wiktionary.org topped the March 2023 update's "Absolute Losers" list ahead of urbandictionary.com (#2) and thefreedictionary.com (#3), although both had a larger relative decrease than Wiktionary's -22.66%. Wiktionary was found to have declined by -51.70% in this update. This may indicate that such changes are merely palimpsestuous snapshots of the long timeline of Google Search. (And indeed Google has since conducted two further "core updates" for October and November 2023, which Amsive does not appear to have analyzed yet.) Still, these results illustrate that Wikipedia's prominence in search engine results is by no means ubiquitous and static.

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
 * Until December 15, the Wikimedia Foundation is inviting applications for its Research Fund grants of up to $50k, "particularly encourag[ing] research studies on medium to small size languages and communities, as well as in low resourced languages and projects." See also our coverage of previous rounds.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Ca and Tilman Bayer

"Evaluation of Accuracy and Adequacy of Kimchi Information in Major Foreign Online Encyclopedias"
From the abstract:  In this study, we analyzed the content and quality of kimchi information in major foreign online encyclopedias, such as Baidu Baike, Encyclopædia Britannica, Citizendium, and Wikipedia. Our results revealed that the kimchi information provided by these encyclopedias was often inaccurate or inadequate, despite kimchi being a fundamental part of Korean cuisine. The most common inaccuracies were related to the definition and origins of kimchi and its ingredients and preparation methods.

"Speech Wikimedia: A 77 Language Multilingual Speech Dataset"
Abstract: "The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models."

"WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections"
From the "Conclusion" section: "We created WIKITABLET, a dataset that contains Wikipedia article sections and their corresponding tabular data and various metadata. WIKITABLET contains millions of instances covering a broad range of topics and kinds of generation tasks. Our manual evaluation showed that humans are unable to differentiate the [original Wikipedia text] and model generations [by transformer models that the authors trained specifically for this task]. However, qualitative analysis showed that our models sometimes struggle with coherence and factuality, suggesting several directions for future work." The authors of this 2021 paper note that they "did not experiment with pretrained models [such as the GPT series] because they typically use the entirety of Wikipedia, which would presumably overlap with our test set."

"Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective"
From the abstract:  "Recent advances in machine learning [this sentence appears to have been written in 2020] have made it possible to train NLG [ natural language generation] systems that seek to achieve human-level performance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-resourced Wikipedia language versions, which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to generate an introductory sentence from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the summary sentences score well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles." The paper, published in 2022, does not yet mention the related Abstract Wikipedia project.

"XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages"
From the abstract:    "Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary." The paper's "Related work" section provides a useful literature overview, noting e.g. that  "Automated generation of Wikipedia text has been a problem of interest for the past 5–6 years. Initial efforts in the fact-to-text (F2T) line of work focused on generating short text, typically the first sentence of Wikipedia pages using structured fact tuples. [...] Seq-2-seq neural methods [including various LSTM architectures and efforts based on pretrained transformers] have been popularly used for F2T. [...] Besides generating short Wikipedia text, there have also been efforts to generate Wikipedia articles by summarizing long sequences. [...] For all of these datasets, the generated text is either the full Wikipedia article or text for a specific section. The authors note that most of these efforts have been English-only.

See also our 2018(!) coverage of various fact-to-text efforts, going back to 2016: "Readers prefer summaries written by a neural network over those by Wikipedians 40% of the time — but it still suffers from hallucinations"