Wikipedia:Wikipedia Signpost/2024-07-04/Recent research

Report by conservative think-tank presents ample quantitative evidence for "mild to moderate" "left-leaning bias" on Wikipedia
A paper titled "Is Wikipedia Politically Biased?" answers that question with a qualified yes:  [...] this report measures the sentiment and emotion with which political terms are used in [English] Wikipedia articles, finding that Wikipedia entries are more likely to attach negative sentiment to terms associated with a right-leaning political orientation than to left-leaning terms. Moreover, terms that suggest a right-wing political stance are more frequently connected with emotions of anger and disgust than those that suggest a left-wing stance. Conversely, terms associated with left-leaning ideology are more frequently linked with the emotion of joy than are right-leaning terms.

Our findings suggest that Wikipedia is not entirely living up to its neutral point of view policy, which aims to ensure that content is presented in an unbiased and balanced manner.

The author (David Rozado, an associate professor at Otago Polytechnic) has published ample peer-reviewed research on related matters before, some of which was featured e.g. in The Guardian and The New York Times. In contrast, the present report is not peer-reviewed and was not posted in an academic venue, unlike most research we cover here usually. Rather, it was published (and possibly commissioned) by the Manhattan Institute, a conservative US think tank, which presumably found its results not too objectionable. (Also, some – broken – URLs in the PDF suggest that Manhattan Institute staff members were involved in the writing of the paper.) Still, the report indicates an effort to adhere to various standards of academic research publications, including some fairly detailed descriptions of the methods and data used. It is worth taking it more seriously than, for example, another recent report that alleged a different form of political bias on Wikipedia, which had likewise been commissioned by an advocacy organization and authored by an academic researcher, but was met with severe criticism by the Wikimedia Foundation (who called it out for "unsubstantiated claims of bias") and volunteer editors (see prior Signpost coverage).

That isn't to say that there can't be some questions about the validity of Rozado's results, and in particular about how to interpret them. But let's first go through the paper's methods and data sources in more detail.

Determining the sentiment and emotion in Wikipedia's coverage
The report's main results regarding Wikipedia are obtained as follows:  "We first gather a set of target terms (N=1,628) with political connotations (e.g., names of recent U.S. presidents, U.S. congressmembers, U.S. Supreme Court justices, or prime ministers of Western countries) from external sources. We then identify all mentions in English-language Wikipedia articles of those terms.

We then extract the paragraphs in which those terms occur to provide the context in which the target terms are used and feed a random sample of those text snippets to an LLM (OpenAI’s gpt-3.5-turbo), which annotates the sentiment/emotion with which the target term is used in the snippet. To our knowledge, this is the first analysis of political bias in Wikipedia content using modern LLMs for annotation of sentiment/emotion." The sentiment classification rates the mention of a terms as negative, neutral or positive. (For the purpose of forming averages this is converted into a quantitative scale from -1 to +1.) See the end of this review for some concrete examples from the paper's published dataset.

The emotion classification uses "Ekman’s six basic emotions (anger, disgust, fear, joy, sadness, and surprise) plus neutral."

The annotation method used appears to be an effort to avoid the shortcomings of popular existing sentiment analysis techniques, which often only rate the overall emotional stance of a given text overall without determining whether it actually applies to a specific entity mentioned in it (or in some cases even fail to handle negations, e.g. by classifying "I am not happy" as a positive emotion). Rozado justifies the "decision to use automated annotation" (which presumably rendered considerable cost savings, also by resorting to OpenAI's older GPT 3.5 model rather than the more powerful but more expensive GPT-4 API released in March 2023) citing "recent evidence showing how top-of-the-rank LLMs outperform crowd workers for text-annotation tasks such as stance detection." This is indeed becoming a more widely used choice for text classification. But Rozado appears to have skipped the usual step of evaluating the accuracy of this automated method (and possibly improving the prompts it used) against a gold standard sample from (human) expert raters.

Selecting topics to examine for bias
As for the selection of terms whose Wikipedia coverage to annotate with this classifier, Rozado does a lot of due diligence to avoid cherry-picking: "To reduce the degrees of freedom of our analysis, we mostly use external sources of terms [including Wikipedia itself, e.g. its list of members of the 11th US Congress] to conceptualize a political category into left- and right-leaning terms, as well as to choose the set of terms to include in each category."|undefined This addresses an important source of researcher bias.

Overall, the study arrives at 12 different groups of such terms:
 * 8 of these refer to people (e.g. US presidents, US senators, UK members of parliament, US journalists).
 * Two are about organizations (US think tanks and media organizations).
 * The other two groups contain "Terms that describe political orientation", i.e. expressions that carry a left-leaning or right-leaning meaning themselves:
 * 18 "political leanings" (where "Rightists" receives the lowest average sentiment and "Left winger" the highest), and
 * 21 "extreme political ideologies" (where "Ultraconservative" scores lowest and "radical-left" has the highest – but still slightly negative – average sentiment)

What is "left-leaning" and "right-leaning"?
As discussed, Rozado's methods for generating these lists of people and organizations seem reasonably transparent and objective. It gets a bit murkier when it comes to splitting them into "left-leaning" and "right-leaning", where the chosen methods remain unclear and/or questionable in some cases. Of course there is a natural choice available for US Congress members, where the confines of the US two-party system mean that the left-right spectrum can be easily mapped easily to Democrats vs. Republicans (disregarding a small number of independents or libertarians).

In other cases, Rozado was able to use external data about political leanings, e.g. "a list of politically aligned U.S.-based journalists" from Politico. There may be questions about construct validity here (e.g. it classifies Glenn Greenwald or Andrew Sullivan as "journalists with the left"), but at least this data is transparent and determined by a source not invested in the present paper's findings.

But for example the list of UK MPs used contains politicians from 14 different parties (plus independents). Even if one were to confine the left vs. right labels to the two largest groups in the UK House of Commons (Tories vs. Labour and Co-operative Party, which appears to have been the author's choice judging from Figure 5), the presence of a substantial number of parliamentarians from other parties to the left or right of those would make the validity of this binary score more questionable than in the US case. Rozado appears to acknowledge a related potential issue in a side remark when trying to offer an explanation for one of the paper's negative results (no bias) in this case: "The disparity of sentiment associations in Wikipedia articles between U.S. Congressmembers and U.K. MPs based on their political affiliation may be due in part to the higher level of polarization in the U.S. compared to the U.K."

This kind of question become even more complicated for the "Leaders of Western Countries" list (where Tony Abbott scored the most negative average sentiment, and José Luis Rodríguez Zapatero and Scott Morrison appear to be in a tie for the most positive average sentiment). Most of these countries do not have a two-party system either. Sure, their leaders usually (like in the UK case) hail from one of the two largest parties, one of which is more to the left and the another more to the right. But it certainly seems to matter for the purpose of Rozado's research question whether that major party is more moderate (center-left or center-right, with other parties between it and the far left or far right) or more radical (i.e. extending all the way to the far-left or far-right spectrum of elected politicians).

What's more, the analysis for this last group compares political orientations across multiple countries. Which brings us to a problem that Wikipedia's Jimmy Wales had already pointed to back in 2006 in response a conservative US blogger who had argued that there was "a liberal bias in many hot-button topic entries" on English Wikipedia:  "The Wikipedia community is very diverse, from liberal to conservative to libertarian and beyond. If averages mattered, and due to the nature of the wiki software (no voting) they almost certainly don't, I would say that the Wikipedia community is slightly more liberal than the U.S. population on average, because we are global and the international community of English speakers is slightly more liberal than the U.S. population. ... The idea that neutrality can only be achieved if we have some exact demographic matchup to [the] United States of America is preposterous." We already discussed this issue in our earlier reviews of a notable series of papers by Greenstein and Zhu (see e.g.: "Language analysis finds Wikipedia's political bias moving from left to right", 2012), which had relied on a US-centric method of defining left-leaning and right-leaning (namely, a corpus derived from the US Congressional Record). Those studies form a large part of what Rozado cites as "[a] substantial body of literature [that]—albeit with some exceptions—has highlighted a perceived bias in Wikipedia content in favor of left-leaning perspectives." (The cited exception is a paper that had found "a small to medium size coverage bias against [members of parliament] from the center-left parties in Germany and in France", and identified patterns of "partisan contributions" as a plausible cause.) Similarly, 8 out of the 10 groups of people and organizations analyzed in Rozado's study are from the US (the two exceptions being the aforementioned lists of UK MPs and leaders of Western countries).

In other words, one potential reason for the disparities found by Rozado might simply be that he is measuring an international encyclopedia with a (largely) national yardstick of fairness. This shouldn't let us dismiss his findings too easily. But it is a bit disappointing that this possibility is nowhere addressed in the paper, even though Rozado diligently discusses some other potential limitations of the results. E.g. he notes that "some research has suggested that conservatives themselves are more prone to negative emotions and more sensitive to threats than liberals", but points out that the general validity of those research results remains doubtful.

Another limitation is that a simple binary left vs. right classification might be hiding factors that can shed further light on bias findings. Even in the US with its two-party system, political scientists and analysts have long moved to less simplistic measures of political orientations. A widely used one is the NOMINATE method which assigns members of the US Congress continuous scores based on their detailed voting record, one of which corresponds to the left-right spectrum as traditionally understood. One finding based on that measure that seems relevant in context of the present study is the (widely discussed but itself controversial) asymmetric polarization thesis, which argues that "Polarization among U.S. legislators is asymmetric, as it has primarily been driven by a substantial rightward shift among congressional Republicans since the 1970s, alongside a much smaller leftward shift among congressional Democrats" (as summarized in the linked Wikipedia article). If, for example, higher polarization was associated with negative sentiments, this could be a potential explanation for Rozado's results. Again, this has to remain speculative, but it seems another notable omission in the paper's discussion of limitations.

What does "bias" mean here?
A fundamental problem of this study, which, to be fair, it shares with much fairness and bias research (in particular on Wikipedia's gender gap, where many studies similarly focus on binary comparisons that are likely to successfully appeal to an intuitive sense of fairness) consists of justifying its answers to the following two basic questions:


 * 1) What would be a perfectly fair baseline, a result that makes us confident to call Wikipedia unbiased?
 * 2) If there are deviations from that baseline (often labeled disparities, gaps or biases), what are the reasons for that – can we confidently assume they were caused by Wikipedia itself (e.g. demographic imbalances in Wikipedia's editorship), or are they more plausibly attributed to external factors?

Regarding 1 (defining a baseline of unbiasedness), Rozado simply assumes that this should imply statistically indistinguishable levels of average sentiment between left and right-leaning terms. However, as cautioned by one leading scholar on quantitative measures of bias, "the 'one true fairness definition' is a wild goose chase" – there are often multiple different definitions available that can all be justified on ethical grounds, and are often contradictory. Above, we already alluded to two potentially diverging notions of political unbiasedness for Wikipedia (using an international instead of US metric for left vs right leaning, and taking into account polarization levels for politicians).

But yet another question, highly relevant for Wikipedians interested in addressing the potential problems reported in this paper, is how much its definition lines up with Wikipedia's own definition of neutrality. Rozado clearly thinks that it does:  Wikipedia’s neutral point of view (NPOV) policy aims for articles in Wikipedia to be written in an impartial and unbiased tone. Our results suggest that Wikipedia’s NPOV policy is not achieving its stated goal of political-viewpoint neutrality in Wikipedia articles. WP:NPOV indeed calls for avoiding subjective language and expressing judgments and opinions in Wikipedia's own voice, and Rozado's findings about the presence of non-neutral sentiments and emotions in Wikipedia articles are of some concern in that regard. However, that is not the core definition of NPOV. Rather, it refers to "representing fairly, proportionately, and, as far as possible, without editorial bias, all the significant views that have been published by reliable sources on a topic." What if the coverage of the terms examined by Rozado (politicians, etc.) in those reliable sources, in their aggregate, were also biased in the sense of Rozado's definition? US progressives might be inclined to invoke the snarky dictum "reality has a liberal bias" by comedian Stephen Colbert. Of course, conservatives might object that Wikipedia's definition of reliable sources (having "a reputation for fact-checking and accuracy") is itself biased, or applied in a biased way by Wikipedians. For some of these conservatives (at least those that are not also conservative feminists) it may be instructive to compare examinations of Wikipedia's gender gaps, which frequently focus on specific groups of notable people like in Rozado's study. And like him, they often implicitly assume a baseline of unbiasedness that implies perfect symmetry in Wikipedia's coverage – i.e. the absence of gaps or disparities. Wikipedians often object that this is in tension with the aforementioned requirement to reflect coverage in reliable sources. For example, Wikipedia's list of Fields medalists (the "Nobel prize of Mathematics") is 97% male – not because of Wikipedia editors' biases against women, but because of a severe gender imbalance in the field of mathematics that is only changing slowly, i.e. factors outside Wikipedia's influence.

All this brings us to question 2. above (causality). While Rozado uses carefully couched language in this regard ("suggests" etc, e.g. "These trends constitute suggestive evidence of political bias embedded in Wikipedia articles"), such qualifications are unsurprisingly absent in much of the media coverage of this study (see also this issue's In the media). For example, the conservative magazine The American Spectator titled its article about the paper "Now We've Got Proof that Wikipedia is Biased."

Commendably, the paper is accompanied by a published dataset, consisting of the analyzed Wikipedia text snippets together with the mentioned term and the sentiment or emotion identified by the automated annotation. For illustration, below are the sentiment ratings for mentions of the Yankee Institute for Public Policy (the last term in the dataset, as a non-cherry-picked example), with the term bolded:

Briefly

 * Wiki Workshop 2024: The eleventh annual Wiki Workshop (formerly part of The Web Conference), organized by the Wikimedia Foundation's research department, took place as an online event on June 20. Extended abstracts (non-archival), video presentations, notes from the event are now available.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''

How English Wikipedia mediates East Asian historical disputes with Habermasian communicative rationality
From the abstract:  "We compare the portrayals of Balhae, an ancient kingdom with contested contexts between [South Korea and China]. By comparing Chinese, Korean, and English Wikipedia entries on Balhae, we identify differences in narrative construction and framing. Employing Habermas’s typology of human action, we scrutinize related talk pages on English Wikipedia to examine the strategic actions multinational contributors employ to shape historical representation. This exploration reveals the dual role of online platforms in both amplifying and mediating historical disputes. While Wikipedia’s policies promote rational discourse, our findings indicate that contributors often vacillate between strategic and communicative actions. Nonetheless, the resulting article approximates Habermasian ideals of communicative rationality." From the paper:  "The English Wikipedia presents Balhae as a multi-ethnic kingdom, refraining from emphasizing the dominance of a single tribe. In comparison to the two aforementioned excerpts [from Chinese and Korean Wikipedia], the lead section of the English Wikipedia concentrates more on factual aspects of history, thus excluding descriptions that might entail divergent interpretations. In other words, this account of Balhae has thus far proven acceptable to a majority of Wikipedians from diverse backgrounds. [...] Compared to other language versions, the English Wikipedia forthrightly acknowledges the potential disputes regarding Balhae's origin, ethnic makeup, and territorial boundaries, paving the way for an open and transparent exploration of these contested historical subjects. The separate 'Balhae controversies' entry is dedicated to unpacking the contentious issues. In essence, the English article adopts a more encyclopedic tone, aligning closely with Wikipedia's mission of providing information without imposing a certain perspective." (See also excerpts)

Facebook/Meta's "No Language Left Behind" translation model used on Wikipedia
From the abstract of this publication by a large group of researchers (most of them affiliated with Meta AI): "Focusing on improving the translation qualities of a relatively small group of high-resource languages comes at the expense of directing research attention to low-resource languages, exacerbating digital inequities in the long run. To break this pattern, here we introduce No Language Left Behind—a single massively multilingual model that leverages transfer learning across languages. [...] Compared with the previous state-of-the-art models, our model achieves an average of 44% improvement in translation quality as measured by BLEU. By demonstrating how to scale NMT [ neural machine translation] to 200 languages and making all contributions in this effort freely available for non-commercial use, our work lays important groundwork for the development of a universal translation system."

 "Four months after the launch of NLLB-200 [in 2022], Wikimedia reported that our model was the third most used machine translation engine used by Wikipedia editors (accounting for 3.8% of all published translations) (https://web.archive.org/web/20221107181300/https://nbviewer.org/github/wikimedia-research/machine-translation-service-analysis-2022/blob/main/mt_service_comparison_Sept2022_update.ipynb). Compared with other machine translation services and across all languages, articles translated with NLLB-200 has the lowest percentage of deletion (0.13%) and highest percentage of translation modification kept under 10%."

"Which Nigerian-Pidgin does Generative AI speak?" – only the BBC's, not Wikipedia's
From the abstract:  "Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria [...]. Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on Naija written in the BBC genre. In other words, Naija written in Wikipedia genre is not represented in Generative AI." The paper's findings are consistent with an analysis by the Wikimedia Foundation's research department that compared the number of Wikipedia articles to the number of speakers for the top 20 most-spoken languages, where Naija stood out as one of the most underrepresented.

"[A] surprising tension between Wikipedia's principle of safeguarding against self-promotion and the scholarly norm of 'due credit'"
From the abstract: Although Wikipedia offers guidelines for determining when a scientist qualifies for their own article, it currently lacks guidance regarding whether a scientist should be acknowledged in articles related to the innovation processes to which they have contributed. To explore how Wikipedia addresses this issue of scientific "micro-notability", we introduce a digital method called Name Edit Analysis, enabling us to quantitatively and qualitatively trace mentions of scientists within Wikipedia's articles. We study two CRISPR-related Wikipedia articles and find dynamic negotiations of micro-notability as well as a surprising tension between Wikipedia’s principle of safeguarding against self-promotion and the scholarly norm of “due credit.” To reconcile this tension, we propose that Wikipedians and scientists collaborate to establish specific micro-notability guidelines that acknowledge scientific contributions while preventing excessive self-promotion. See also coverage of a different paper that likewise analyzed Wikipedia's coverage of CRISPR: "Wikipedia as a tool for contemporary history of science: A case study on CRISPR"

"How article category in Wikipedia determines the heterogeneity of its editors"
From the abstract: " [...] the quality of Wikipedia articles rises with the number of editors per article as well as a greater diversity among them. Here, we address a not yet documented potential threat to those preconditions: self-selection of Wikipedia editors to articles. Specifically, we expected articles with a clear-cut link to a specific country (e.g., about its highest mountain, "national" article category) to attract a larger proportion of editors of that nationality when compared to articles without any specific link to that country (e.g., "gravity", "universal" article category), whereas articles with a link to several countries (e.g., "United Nations", "international" article category) should fall in between. Across several language versions, hundreds of different articles, and hundreds of thousands of editors, we find the expected effect [...]"

"What do they make us see:" The "cultural bias" of GLAMs is worse on Wikidata
From the abstract: "Large cultural heritage datasets from museum collections tend to be biased and demonstrate omissions that result from a series of decisions at various stages of the collection construction. The purpose of this study is to apply a set of ethical criteria to compare the level of bias of six online databases produced by two major art museums, identifying the most biased and the least biased databases. [...] For most variables the online system database is more balanced and ethical than the API dataset and Wikidata item collection of the two museums."