Wikipedia:Wikipedia Signpost/2024-03-02/Recent research

Images on Wikipedia "amplify gender bias" compared to article text

 * Reviewed by Bri and Tilman Bayer

A Nature paper titled "Online Images Amplify Gender Bias" studies:  "gender associations of 3,495 social categories (such as 'nurse' or 'banker') in more than one million images from Google, [English] Wikipedia and Internet Movie Database (IMDb), and in billions of words from these platforms" As summarized by Neuroscience News:  This pioneering study indicates that online images not only display a stronger bias towards men but also leave a more lasting psychological impact compared to text, with effects still notable after three days.

This was a two-part research paper in which the authors: While the paper's main analyses focus on Google, the authors replicated their findings with text and image data from Wikipedia and IMDb.
 * examined text and images from the Internet for gender bias
 * examined the responses of experimental subjects who were exposed to text and images from the Internet

Gender bias in text and images
For the first part, images were retrieved from Google Search results for 3,495 social categories drawn from WordNet, a canonical database of categories in the English language. These categories include occupations—such as doctor, lawyer and carpenter—and generic social roles, such as neighbour, friend and colleague. Faces extracted from these images (using the OpenCV library) were tagged with gender by workers recruited via Amazon Mechanical Turk. The reliability of tagging was validated against the self-identified gender from a "canonical set" of celebrity portraits culled from IMDb and Wikipedia.

For the replication analysis with English Wikipedia (relegated mainly to the paper's supplement), an analogous set of images was derived using another existing Wikipedia image dataset, whose text descriptions yielded matches for 1,523 of the 3,495 WordNet-derived social categories (For example, we retrieve the Wikipedia article with the title ‘Physician’ for the social category physician: https://en.wikipedia.org/wiki/Physician).

To measure gender bias in a corpus of text from e.g. Google News, the authors use word embeddings (a computational natural language processing technique) trained on that corpus. Specifically, their method (adapted from a 2019 paper) assigns a number to each category (e.g. doctor, lawyer or carpenter) that captures the extent to which [the word for this category] co-occurs with textual references to either women or men [in the corpus]. This method allows us to position each category along a −1 (female) to 1 (male) axis, such that categories closer to −1 are more commonly associated with women and those closer to 1 are more commonly associated with men [in the corpus]. [...] The category ‘aunt’, for instance, falls close to −1 along this scale, whereas the category ‘uncle’ falls close to 1 along this scale. The authors interpret any deviation of this "gender association" value from 0 as evidence of "gender bias" for a particular category. Figure 1 in the paper illustrates this in case of Google News for a list of occupations. There, the three categories with the largest male bias appear to be "football player", "philosopher", and "mechanic", and the three categories with the largest female bias "cosmetologist", "ballet dancer", and "hairstylist". In the figure, the category closest to being unbiased (0) in the Google News text was "programmer". Overall though, texts from Google News exhibit [only] a relatively weak bias towards male representation, with an average score of 0.03.

In case of Wikipedia text, this gender association of a particular WordNet category was determined using a pre-trained word embedding model of Wikipedia available in Python’s gensim package, which was built using the GloVe method to analyze a 2014 corpus of 5.6 billion words from Wikipedia. Somewhat concerningly, this description by the authors is inconsistent with the gensim documentation, which states that this 5.6 billion token corpus was not based on Wikipedia alone, but on "Wikipedia 2014 + Gigaword". According to the original GloVe paper, "Gigaword 5 [...] has 4.3 billion tokens", meaning that it would form a much bigger part of that corpus than Wikipedia. (The GloVe authors also observed that Wikipedia's entries are updated to assimilate new knowledge, whereas Gigaword is a fixed news repository with outdated and possibly incorrect information; the corpus contains newswire text dating back to 1994.)

In other words, the Nature study's conclusions about Wikipedia text might not be valid. Assuming they are though, they might seem vaguely reassuring for Wikipedians (and perhaps somewhat in contrast with earlier research about textual gender bias on Wikipedia): Using several different variants of the model (with different word embedding dimensions), respectively, 57% (50D), 59% (100D), 57.6% (200D), and 54% (300D) of categories [are] male-skewed, with an average strength of gender association below 0.06 (recall that the authors describe the corresponding value of 0.03 for Google News as a relatively weak bias). The story is different for images, though:  images over Wikipedia are significantly skewed toward male representation. 80% of categories are male-skewed according to images over Wikipedia (p < 0.0001, proportion test, n = 495, two-tailed). [...] Including all 1,244 categories in our analysis continues to show a strong bias toward male representation in Wikipedia images (with 68% of faces being male, p < 0.00001). [...] Wikipedia content can appear to be neutral in its gender associations if one focuses only on text, whereas examining Wikipedia images from the same articles can reveal a different reality, with evidence of a strong bias toward male representation and a stronger bias toward more salient gender associations in general.

Impact of image vs. text search on users' gender bias
For the second part (which did not involve Wikipedia directly), the researchers ... conducted a nationally representative, preregistered experiment that shows that googling for images rather than textual descriptions of occupations amplifies gender bias in participants’ beliefs. To measure participants' gender bias after they had completed the googling task, an implicit association test (IAT) methodology was used, which supposedly reveals unconscious bias in a timed sorting task. In the researchers' words, "the participant will be fast at sorting in a manner that is consistent with one's latent associations, which is expected to lead to greater cognitive fluency [lower measured sorting times] in one's intuitive reactions." Specifically, the IAT variant used was designed to detect the implicit bias towards associating women with liberal arts and men with science. The test measured how long participants took to associate a particular word or image (e.g. "Girl", "Engineering", "Grandpa", "Fashion") with either the male/female or science/liberal arts categories.

The labeling of text descriptions was performed by other humans recruited via Amazon Mechanical Turk. Both the test subject, and the labelers, were adults from the United States, and the test subjects were screened to be representative of the U.S. population to include a nearly 50/50 male/female split (none self identified as other than those two categories). The experiment focused on a sample of 22 occupations, e.g. immunologist, harpist, hygienist, and intelligence analyst.

Some test subjects were given a task related to occupation-related text prior to the IAT, and some were given a task related to images. The task was either to use Google search to retrieve images of representative individuals in the occupation, or Google search to retrieve a textual description of the occupation. A control group performed an unrelated Google search. Before the IAT was performed, the test subjects were required to indicate on a sliding scale, for each of the occupations, "which gender do you most expect to belong to this category?" The test was performed again a few days later with the same test subjects.

On the second test, subjects exposed to images in the first test had a stronger IAT score for bias than those exposed to text.

The experimental part of the study depends partly on IAT and partly on self-assessment to detect priming, and there are concerns about replicability concerning the priming effect, and the validity and reliability of IAT. Some of the concerns are described at. It seemed that the authors recognized this (We acknowledge important continuing debate about the reliability of the IAT), and in their own study found that the distribution of participants' implicit bias scores [arrived at with IAT] was less stable across our preregistered studies than the distribution of participants' explicit bias scores, and discounted the implicit bias scores somewhat.

The conclusion drawn by the researchers, based partly but not entirely on the different IAT scores of experimental subjects, was that of the paper title: "images amplify gender bias" — both explicitly as determined by the subject's assignments of occupation to gender on a sliding scale, and implicitly as determined by reaction times measured in the IAT.

Takeaways
The paper opens with the (rather thinly referenced) observation that "Each year, people spend less time reading and more time viewing images". Combined with the finding that searching for occupation images on Google amplified participants' gender biases, this forms an "alarming" trend according to the study's lead author (Douglas Guilbeault of UC Berkeley's Haas School of Business), as quoted by AFP on "the potential consequences this can have on reinforcing stereotypes that are harmful, mostly to women, but also to men".

The researchers also determined, apart from experimental subjects, that the Internet – represented singularly by Google News – exhibits a strong gender bias. It was unclear to this reviewer how much of the reported Internet bias is really "Google selection bias". Based on these findings, the authors go on to speculate that "gender biases in multimodal AI may stem in part from the fact that they are trained on public images from platforms such as Google and Wikipedia, which are rife with gender bias according to our measures".

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
 * Submissions are open until April 22, 2024 for Wiki Workshop 2024, to take place on June 20, 2024. The virtual event will be the eleventh in this annual series (formerly part of The Web Conference), and is organized by the Wikimedia Foundation's research team with other collaborators. The call for contributions asks for 2-page extended abstracts which will be "non-archival, meaning we welcome ongoing, completed, and already published work."

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer

"Gender stereotypes embedded in natural language [of Wikipedia articles] are stronger in more economically developed and individualistic countries"
From the abstract:  From the abstract: "[...] measuring stereotypes is difficult, particularly in a cross-cultural context. Word embeddings are a recent useful tool in natural language processing permitting to measure the collective gender stereotypes embedded in a society. [...] We considered stereotypes associating men with career and women with family as well as those associating men with math or science and women with arts or liberal arts. Relying on two different sources (Wikipedia and Common Crawl), we found that these gender stereotypes are all significantly more pronounced in the text corpora of more economically developed and more individualistic countries. [...] our analysis sheds light on the “gender equality paradox,” i.e. on the fact that gender imbalances in a large number of domains are paradoxically stronger in more developed/gender equal/individualistic countries." To determined "the relative contribution of residents from each country to each language [version of Wikipedia]", the author (a researcher at CNRS) used the Wikimedia Foundation's "WiViVi" dataset which provides the percentage of pageviews per country for a given language Wikipedia. This data is somewhat outdated (last updated in 2018) and also, for the goal of measuring contribution (rather than consumption), the separate Geoeditors dataset might have been worth considering (which provides the number of editors per country, although with - somewhat controversial - privacy redactions).

"Poor attention: The wealth and regional gaps in event attention and coverage on Wikipedia"
From the abstract:  "for many people around the world, [Wikipedia] serves as an essential news source for major events such as elections or disasters. Although Wikipedia covers many such events, some events are underrepresented and lack attention, despite their newsworthiness predicted from news value theory. In this paper, we analyze 17 490 event articles in four Wikipedia language editions and examine how the economic status and geographic region of the event location affects the attention [page views] and coverage [edits] it receives. We find that major Wikipedia language editions have a skewed focus, with more attention given to events in the world’s more economically developed countries and less attention to events in less affluent regions. However, other factors, such as the number of deaths in a disaster, are also associated with the attention an event receives." Relatedly, a 2016 paper titled "Dynamics and biases of online attention: the case of aircraft crashes" had found:  that the attention given by Wikipedia editors to pre-Wikipedia aircraft incidents and accidents depends on the region of the airline for both English and Spanish editions. North American airline companies receive more prompt coverage in English Wikipedia. We also observe that the attention given by Wikipedia visitors is influenced by the airline region but only for events with a high number of deaths. Finally we show that the rate and time span of the decay of attention is independent of the number of deaths and a fast decay within about a week seems to be universal.

A new corpus of Wikipedia passages about events, paired with potential sources
From the abstract:  "[...] we present FAMuS, a new corpus of Wikipedia passages that report on some event, paired with underlying, genre-diverse (non-Wikipedia) source articles for the same event. Events and (cross-sentence) arguments in both report and source are annotated against FrameNet, providing broad coverage of different event types. We present results on two key event understanding tasks enabled by FAMuS: source validation -- determining whether a document is a valid source for a target report event -- and cross-document argument extraction -- full-document argument extraction for a target event from both its report and the correct source article. "

"Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities"
From the abstract of this preprint by a group of authors from Google Research and Georgia Institute of Technology:  "... we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning."

"Understanding Structured Knowledge Production: A Case Study of Wikidata’s Representation Injustice"
From the paper:  "... through a case study of comparing human [Wikidata] items of two countries, Vietnam and Germany, we propose several reasons that might lead to the existing biases in the knowledge contribution process. [...] We chose Germany and Vietnam as subjects based on three primary considerations. Firstly, both nations have comparable population sizes. Secondly, the editors who speak the predominant languages of each country maintain their distinct Wiki communities on Wikidata. [...] The first analysis we did was comparing different components of Wikidata pages between pages in two countries. The components we are comparing are labels, descriptions, claims, and sitelinks. For a single Wikidata page, label is the name that this item is known by, while description is a short sentence or phrase that also serves disambiguate purpose. [...] In the dataset we collected, there are 290,750 people who have citizenship of Germany, and there are only 4,744 people who have citizenship of Vietnam. [...] German pages on average had 13 more labels, 5 more descriptions and 7 more claims compared to Vietnamese pages. While surprisingly, Vietnamese pages had slightly more sitelinks, the difference according to effect size was negligible. The second analysis focused on the edit history of Wikidata items. [...] we quantified the attention metric into five features: Number of total edits, number of human edits, number of bot edits, and number of distinct bot and human edits. [...] in all the five features the [difference in means between the German and Vietnamese Wikidata human pages] is significant and in terms of bot activity and total activity, the effect size is beyond medium threshold (0.5).

"The Politics of Memory: An Extended Case Study of the Memory of Crisis on Wikipedia"
From the abstract:  ... an extended case study is developed on the (re)construction of a major pollution event (the [1952] Great Smog of London). Critical discourse analysis of intertextuality (connections between texts through hyperlinking and other shared patterning) is utilised to move from a focus on micro level practices to macro and meta level findings on the ordering of Wikipedia and its interactions with other institutions. Findings evidence a layered, self-referencing formation across texts, favouring the interests of established institutions and providing limited opportunity for marginalised groups to interact with sustained (re)constructions of the Great Smog. Comparison to a previous study of the constructed memory of a crisis (the London Bombings 2005) reveals dynamics across Wikipedia that lead to an emphasis on connecting (re)constructions to institutional traditions rather than the potential usefulness of such (re)construction for those at higher risk of negative outcomes arising from repeated crises.