Wikipedia:Wikipedia Signpost/2020-09-27/Recent research

"Uneven Coverage of Natural Disasters in Wikipedia: The Case of Floods"
A paper with this title, presented earlier this year at the "International Conference on Information Systems for Crisis Response and Management" (ISCRAM 2020), adds to the growing literature on Wikipedia's content biases, finding that while the English Wikipedia "offers good coverage of disasters, particularly those having a large number of fatalities [...] the coverage of floods in Wikipedia is skewed towards rich, English-speaking countries, in particular the US and Canada."

Any bias analysis of this kind is faced with the problem of identifying an unbiased "ground truth" that Wikipedia's coverage can be compared to. The researchers approach this diligently, resorting to "three of the most comprehensive databases documenting floods that are commonly used by the hydrology science for reference": Floodlist, which is funded by the EU's Copernicus Programme, the "Emergency Events Database" (EM-DAT), and the University of Colorado's Dartmouth Flood Observatory (DFO). Focusing on a timespan extending from 2016 to 2019, and following an elaborate process involving e.g. defining search criteria for each source and deduplicating the results, they arrived at a consolidated dataset consisting of 1102 flood events, of which only 249 were present in all three databases. The authors asked experts to identify possible reasons for these discrepancies (or biases) between the sources, e.g. the fact that Floodlist includes landslides resulting from heavy rain events that do not meet the definitions of the other two sources. They concluded that these explanations justified relying on events that were covered in at least two of the three sources, resulting in a dataset consisting of 458 floods.

The comparison dataset representing Wikipedia's coverage was constructed using keyword searches to find individual sentences mentioning flood events (rather than entire articles, which one might identify more easily using e.g. Category:Floods).

The analysis of the data focuses on the "hit rate" per country, defined as the percentage of floods from the ground truth dataset that have at least one corresponding item in the Wikipedia dataset. The United States was both the country with the highest number of floods in the ground truth dataset (36, followed by Indonesia with 25 and the Philippines with 17), and the country with by far the highest hit rate (86.11%) among the countries with the highest number of floods. Aggregated by continent, North America likewise had the highest Wikipedia coverage (49.06%), and South America the lowest (10.53%). Interestingly, Europe did not fare very well, with a hit rate of 21.18%, slightly below that of Africa (21.88%) and way behind Asia (which had 37.63% of its floods covered on English Wikipedia).

To identify possible causes of the differing hit rates by country, the authors "analyzed several socio-economic variables to see whether they correlate with floods coverage. These variables are GDP per capita, GNI per capita, country, continent, date, fatalities, number of English speakers and vulnerability index." This analysis consists of presenting various table and graphs with the hit rate plotted over four to six buckets of the independent variable (e.g. Low income / lower middle income / upper middle income / high income), eschewing more sophisticated statistical methods. They find some evidence for a bias toward higher income countries, although the trend is not entirely consistent (e.g. in a different classification into six instead of four income levels, the second-lowest level "Lower middle income" had a higher hit rate than the three above it). They also find evidence of that countries with a higher ratio of English speakers have better coverage, although "The language can be only a partial explanation because for floods in Australia the hit-rate is half and lower than other non-English-speaking countries" (similarly, the UK only ranked 16th in Wikipedia coverage among the top 20 countries with at least five floods in the ground truth data).

Still, the paper's overall conclusion is that "Wikipedia’s coverage is biased towards some countries, particularly those that are more industrialized and have large English-speaking populations, and against some countries, particularly low-income countries which also happen to be among the most vulnerable".

Unfortunately the researchers fail to acknowledge their own glaring bias in this research, namely the decision to exclusively focus on the English Wikipedia in a paper that is repeatedly hand-wringing about language disparities. To be sure, this bias has long been identified as an issue affecting a large part of Wikipedia research, and there are practical reasons for confining such an analysis to a language that researchers are fluent in. But since the authors clearly seem to frame such biases as a bad thing (at one point referring to them as "flaws" of Wikipedia), it is worth asking whether and why they think that the authors of reference works like Wikipedia should not focus their labor on those natural disasters that are more likely to affect their readers. While the study's confinement to only one of Wikipedia's hundreds of languages is mentioned in the "Limitations and future work" section, it is again framed just as an open question about Wikipedia's shortcomings ("understand how an editor’s language affects the coverage bias"), rather than as an acknowledgment of the paper's own.

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''

"Automated Adversarial Manipulation of Wikipedia Articles" using Markov chains
From the abstract and paper:  "The WikipediaBot is a self-contained mechanism with modules for generating credentials for Wikipedia editors, bypassing login protections, and a production of contextually-relevant adversarial edits for target Wikipedia articles that evade conventional detection. The contextually-relevant adversarial edits are generated using an adversarial Markov chain that incorporates a linguistic manipulation attack known as MIM or malware-induced misperceptions. [...]

To show how the WikipediaBot could be used to harm discourse, we analyzed a scenario where a hypothetical adversary aims to reduce mentions of Uyghurs on the Uyghurs Wikipedia page [e.g. by changing "the ongoing repression of the Uyghurs" with "the ongoing repression of the Manchus", and other edits suggested by the MIM engine]. ... we contacted the Wikipedia security team with the details and the inner workings of WikipediaBot prior to writing this publication as part of the responsible disclosure requirement. The exposure of the WikipediaBot system architecture allows for consideration of other types of detection, prevention, and defenses then the one proposed in this paper [which was "to add a more robust CAPTCHA system to prevent edits to individual pages"]. We only tested the WikipediaBot on a local, isolated testbed, and never used it to make any adversarial manipulation on the live Wikipedia platform."

"From web to SMS: A text summarization of Wikipedia pages with character limitation"
From the abstract:  "Due to the limitation of the number of characters, a Wikipedia page cannot always be sent through SMS. This work raises the issue of text summarization with character limitation. To solve this issue, two extractive approaches have been combined: LSA and TextRank algorithms. [...] The evaluation showed the relevance of the approach for pages of at most 2000 characters. The system has been tested using the SMS simulator of RapidSMS without a GSM gateway to simulate the deployment in a real environment." (Compare also previous efforts to make Wikipedia accessible via text messaging)

"RuBQ: A Russian Dataset for Question Answering over Wikidata"
From the abstract:  "The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification."

"Topological Data Analysis on Simple English Wikipedia Articles"
From the abstract and paper :  "We present three statistical approaches for comparing geometric data using two-parameter persistent homology [a tool from topological data analysis ], and we demonstrate the applicability of these approaches on high-dimensional point-cloud data obtained from Simple English Wikipedia articles. [...] The data in this project was produced by applying a Word2Vec algorithm to the text of articles in Simple English Wikipedia, [converting] each of 120,526 articles into a 200-dimension vector, such that articles with similar content produce vectors that are close together. The data also gives a popularity score for each article, indicating how frequently the article is accessed in Simple English Wikipedia. Abstractly, our data is a point cloud of 120,526 points in $$\mathbb{R}^{200}$$, with a real-valued function on each point ..."

Dataset provides "interesting negative information" for Wikidata
From the abstract:  "Rooted in a long tradition in knowledge representation, all popular KBs [knowledge bases] only store positive information, but abstain from taking any stance towards statements not contained in them. In this paper, we make the case for explicitly stating interesting statements which are not true. [..] We introduce two approaches towards automatically compiling negative statements. [...] Experimental results show that both approaches hold promising and complementary potential. Along with this paper, we publish the first datasets on interesting negative information, containing over 1.4M statements for 130K popular Wikidata entities." See also Video and slides, OpenReview page, dataset

Amazon Alexa researchers measure "social bias" on Wikidata
From the abstract:  "We present the first study on social bias in knowledge graph embeddings, and propose a new metric suitable for measuring such bias. We conduct experiments on Wikidata and Freebase, and show that, as with word embeddings, harmful social biases related to professions are encoded in the embeddings with respect to gender, religion, ethnicity and nationality. For example, graph embeddings encode the information that men are more likely to be bankers, and women more likely to be homekeepers." The paper also contains lists of the top male and female professions in Wikidata (relative to female and male, respectively), evaluated by two different metrics. For the first metric (TransE embeddings), the male list is lead by baritone, military commander, banker, racing driver and engineer. The top five entries on the corresponding female professions list are nun, feminist, soprano, suffragette, and mezzo-soprano.

"Wikipedia and Westminster: Quality and Dynamics of Wikipedia Pages about UK Politicians"
From the abstract:  "First, we analyze spatio-temporal patterns of readers' and editors' engagement with MPs' Wikipedia pages, finding huge peaks of attention during election times, related to signs of engagement on other social media (e.g. Twitter). Second, we quantify editors' polarisation and find that most editors specialize in a specific party and choose specific news outlets as references. Finally we observe that the average citation quality is pretty high, with statements on 'Early life and career' missing citations most often (18%)."