Wikipedia:Wikipedia Signpost/2017-01-17/Recent research



Getting more female editors may not increase the ratio of articles about women

 * Reviewed by Reem Al-Kashif

A bachelor's degree thesis by Feli Nicolaes finds that, contrary to the general perception, male and female editors do not tend to edit biographical articles on people of their own gender.

Previous research suggested that one solution to the lack of Wikipedia's biographies of women could be to increase the number of female editors. This was based on the assumption that women would prefer to edit women's biographies, and men would prefer to edit men's biographies. Nicolaes refers to this as homophily in her thesis, "Gender bias on Wikipedia: an analysis of the affiliation network". However, homophily has so far neither been formally investigated nor proved to exist in Wikipedia. Nicolaes analyzes this using datasets from her research group at the University of Amsterdam, of English Wikipedia editors and the pages they edit. She tracks the editing behavior of both self-identified male and female editors on Wikipedia. Contrary to the mainstream assumption, homophily was not found. In other words, female users' edits are not focused on female biography pages. In fact, Nicolaes finds “inverted homophily” when considering female users who edit a single biographical article more than 200 times: they are more likely to direct this amount of attention to biography articles about men than male editors are.

This brings to mind an initiative to increase content about women—be it biography articles or other content related to women—that has been live since December 2015 in the Arabic Wikipedia. The initiative is in a form of contest where male and female editors try to achieve as much as they can from their self-set goals. Over the four rounds of the contest, only one woman reached the top three in two rounds. So, if the goal is to add more content about women, bringing more women might not be useful. However, Nicolaes also argues that the study should be replicated on larger datasets to validate the results. It remains to be seen whether the same editor behaviour exists in other language editions. Another limitation of the study is its apparent reliance on the gender information that editors publicly state in their user preferences—a method that is widely used but may be susceptible to biases (discussed in more detail in this review).

Theorizing the foundations of the gender gap

 * Reviewed by Aaron Shaw

In a forthcoming paper, "'Anyone can edit' not everyone does: Wikipedia and the gender gap", Heather Ford and Judy Wajcman use some of the theoretical tools of feminist science and technology studies (STS) to describe underpinnings of the Wikipedia gender gap. The authors argue that three aspects of Wikipedia's infrastructure define it as a particularly masculine or male-dominated project:
 * (1) the epistemological foundations of what constitutes valid encyclopedic knowledge,
 * (2) Wikipedia's software infrastructure, and
 * (3) Wikipedia's policy infrastructure.

The authors argue that each of these arenas represents a space where male activity and masculine norms of truth, scientific fact, legitimacy, and freedom define boundaries of legitimate contribution and action. Accordingly, these boundaries of legitimate contribution and action systematically exclude or devalue perspectives and contributions that could overcome the lack of female participation or perspectives in the Wikipedia projects. The result, according to Ford and Wajcman, is that Wikipedia has created a novel and powerful form of knowledge-production expertise on a foundation that reproduces existing gender hierarchies and inequalities.

How old and new astronomy papers are being cited

 * Reviewed by Piotr Konieczny

The author analyzes Wikipedia's citations to academic peer reviewed articles, finding that "older papers from before 2008 are increasingly less likely to be cited". The authors attempt to use Wikipedia citations as a proxy for public interest in astronomy, though the analysis makes no comparison to other research about public interest in sciences. The article notes that citations to articles from 2008 are most common, and it represents the peak of citations, with fewer and fewer citations for years since 2008. The analysis is also limited due to the cut-off date (1996), "because Scopus indexing of journals changes in this year". The author concludes that the observed citation pattern is likely "consistent with a moderate tendency towards obsolescence in public interest in research", as papers become obsolete and newer ones are more likely to be cited; older papers are cited for timeless, uncontroversial facts, and newer for newer findings. They also note that the late 2000s, i.e. the years around 2008, may represent when most of Wikipedia's content in astronomy was created, though this is not backed up by much besides speculation. Overall, it is an interesting question, but one that does not provide any surprising insights.

Wikipedia is not a suitable source for election predictions

 * Reviewed by Piotr Konieczny

The topic of this conference paper, "Election prediction based on Wikipedia pageviews", is certainly timely. The authors look at which of Wikipedia's articles related to the US presidential election registered high popularity, and then ask whether elections can be predicted based "on the number of views the spiking pages have and on the correlation between these pages and the presidential nominees or their political program". They provide an online visualization showing some "Wikipedia topics that have spiked before, during or after [an] election event."

The authors limit themselves (reasonably) to the English and Spanish Wikipedias. They do a good job of presenting their methods, and outlining problems with gathering data on popularity of articles—something that would be much easier if Wikipedia articles and databases were more friendly when it comes to information about their popularity. Within the limitations described in the paper, the authors conclude that Wikipedia articles about politicians are used mostly after, not before or during debates or other events such as primaries or elections, which suggests that they are not used for fact checking but instead as an information source after the event. "Wikipedia is not, in fact, a reliable polling source", write the authors, based on (this could be clarified further) the fact that people check Wikipedia after the events, not before them, hence making Wikipedia's pageviews problematic for prediction.

"Black Lives Matter in Wikipedia: Collaboration and collective memory around online social movements"

 * Reviewed by Piotr Konieczny

In this paper, the researchers look at the relation between the Black Lives Matter (BLM) social movement and its coverage in Wikipedia, asking the following research questions: They aim to contribute to academic discourse on social movements and claim to describe "knowledge production and collective memory in a social computing system as the movement and related events are happening." They conclude that Wikipedia is a neutral platform, but does indirectly support (or hinder) the movement (or its opponents) by virtue of increased visibility, in the same vein as coverage by the media would. The quality of the movement's history and documentation on Wikipedia is judged to be of higher value, accessibility, and quality than snapshots on social media platforms like Twitter. Wikipedia also provides space for interested editors to work on articles indirectly related to BLM, further increasing the visibility of related topics, as interested editors move beyond direct BLM articles to other aspects. Examples include historical articles about events preceding BLM that would probably not be written/expanded on in Wikipedia if not for the rise of the BLM movement. The authors conclude that social movement activists can use Wikipedia to document their activities without compromising Wikipedia's neutrality or other policies: "Without breaking with community norms like NPOV, Wikipedia became a site of collective memory documenting mourning practices as well as tracing how memories were encoded and re-interpreted." This is a valuable argument that draws interesting connections between Wikipedia and social movements, particularly considering that some (like this reviewer) consider Wikipedia itself to be a social movement.
 * 1) "How has Wikipedia editing activity and the coverage of BLM movement events changed over time?"
 * 2) "How have Wikipedians collaborated across articles about events and the BLM movement?" and
 * 3) "How are events on Wikipedia re-appraised following new events?"

Conferences and events
The third annual Wiki Workshop will take place on April 4 as part of the WWW2017 conference in Perth, Australia. The workshop serves as a platform for Wikimedia researchers to get together on an annual basis and share their research with each other (see also our overview of the papers from the 2016 edition). All Wikimedia researchers are encouraged to submit papers for the workshop and attend it. More details at the call for papers.

See the research events page on Meta-wiki for other upcoming conferences and events, including submission deadlines.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions are always welcome for reviewing or summarizing newly published research.''
 * "Facilitating the use of Wikidata in Wikimedia projects with a user-centered design approach" From the abstract: "In its current form, [data from Wikidata] is not used to its full potential [on other Wikimedia projects] for a multitude of reasons, as user acceptance is low and the process of data integration is unintuitive and complicated for users. This thesis aims to develop a concept using user-centered design to facilitate the editing of Wikidata data from Wikipedia. With the involvement of the Wikimedia community, a system is designed which integrates with pre-existing work flows."
 * "A corpus of Wikipedia discussions: over the years, with topic, power and gender labels" From the abstract: "... we present a large corpus of Wikipedia Talk page discussions that are collected from a broad range of topics, containing discussions that happened over a period of 15 years. The dataset contains 166,322 discussion threads, across 1236 articles/topics that span 15 different topic categories or domains. The dataset also captures whether the post is made by an registered user or not, and whether he/she was an administrator at the time of making the post. It also captures the Wikipedia age of editors in terms of number of months spent as an editor, as well as their gender."
 * "Wikipedia and the politics of openness" Two reviews of the 2014 book with this title, in the journal Information, Communication & Society and in Contemporary Sociology: A Journal of Reviews , with the latter summarizing the book as follows: "Tkacz's text has three main empirical chapters. The first sorts out the 'politics of openness,' by which he means how collaboration emerges and forms in an open-ended context. The second empirical contribution is about the possibility that the framing of social interaction might, by itself, be enough to create order and encourage productivity in an environment like Wikipedia. ... The third empirical contribution is that project exit has an extremely important role in maintaining the stability of Wikipedia. As people develop projects, they create parallel, break-off versions of a project [forks]."
 * "Derivation of 'is a' taxonomy from Wikipedia category graph"
 * "'En Wikipedia no se escribe jugando': Identidad y motivación en 10 wikipedistas regiomontanos." From the English abstract: "This study qualitatively analyses the contributions in the talk pages of the Spanish Wikipedia by the ten most-active registered users in Monterrey, Mexico. Using virtual ethnography ... this research finds that these self-styled 'wikipedistas' assume the site's collective identity when interacting with anonymous users, and that their main motivations for ongoing participation are not related to the repository of knowledge in itself, but to their group dynamics and inter-personal relationships within the community."
 * "Schreiben in der Wikipedia" ("Writing in Wikipedia") From the book (translated): "From the perspective of Wikipedia research, it can be observed that Wikipedia must not be regarded as a community medium ['gemeinschaftliches Medium'] per se, but that it reflects a conglomerate of individual and community writing processes, which in turn both influence the text genesis, with differing scopes. This chronological development is laid open here for the first time in case of some exemplary article texts, and subsequently, specific properties of each article topic are related to creation of the article that is based on it."
 * "Beyond the Book: linking books to Wikipedia" From the abstract: "The book translation market is a topic of interest in literary studies, but the reasons why a book is selected for translation are not well understood. The Beyond the Book project investigates whether web resources like Wikipedia can be used to establish the level of cultural bias. This work describes the eScience tools used to estimate the cultural appeal of a book: semantic linking is used to identify key words in the text of the book, and afterwards the revision information from corresponding Wikipedia articles is examined to identify countries that generated a more than average amount of contributions to those articles. ... We assume a lack of contributions from a country may indicate a gap in the knowledge of readers from that country. We assume that a book dealing with that concept could be more exotic and therefore more appealing for certain readers ... An indication of the 'level of exoticness' thus could help a reader/publisher to decide to read/translate the book or not. Experimental results are presented for four selected books from a set of 564 books written in Dutch or translated into Dutch, assessing their potential appeal for a Canadian audience."
 * "A multilingual approach to discover cross-language links in Wikipedia" From the abstract: "... given a Wikipedia article (the source) EurekaCL uses the multilingual and semantic features of BabelNet 2.0 in order to efficiently identify a set of candidate articles in a target language that are likely to cover the same topic as the source. The Wikipedia graph structure is then exploited both to prune and to rank the candidates. Our evaluation carried out on 42,000 pairs of articles in eight language versions of Wikipedia shows that our candidate selection and pruning procedures allow an effective selection of candidates which significantly helps the determination of the correct article in the target language version."
 * "Analyzing organizational routines in online knowledge collaborations: a case for sequence analysis in CSCW" From the abstract: "Research into socio-technical systems like Wikipedia has overlooked important structural patterns in the coordination of distributed work. This paper argues for a conceptual reorientation towards sequences as a fundamental unit of analysis for understanding work routines in online knowledge collaboration. Using a data set of 37,515 revisions from 16,616 unique editors to 96 Wikipedia articles as a case study, we analyze the prevalence and significance of different sequences of editing patterns." See also slides and a separate review by Aaron Halfaker ("This is a weird paper. It isn't actually a study. It's more like a methods position paper.")
 * "Wikipedia: medium and model of collaborative public diplomacy" From the abstract: "Taking a case-study approach, the article posits that Wikipedia holds a dual relevance for public diplomacy 2.0: first as a medium; and second, as a model for public diplomacy's evolving process. Exploring Wikipedia's folksonomy, crowd-sourced through intense and organic collaboration, provides insights into the potential of collective agency and symbolic advocacy."
 * "Enabling fine-grained RDF data completeness assessment" From the abstract: "The idea of the paper is to have completeness information over RDF data sources and use it for checking query completeness. In particular, [for Wikidata,] an indexing technique was developed to allow to scale completeness reasoning to Wikidata-scale data sources. The applicability of the framework was verified using Wikidata and COOL-WD, a completeness tool for Wikidata, was developed. The tool is available at http://cool-wd.inf.unibz.it/ "
 * "Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO" From the abstract: "In recent years, several noteworthy large, cross-domain and openly available knowledge graphs (KGs) have been created. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have not been subject to an in-depth comparison so far. In this survey, we provide data quality criteria according to which KGs can be analyzed and analyze and compare the above mentioned KGs." From the paper: "... Wikidata covers all relations of the gold standard, even though it contains considerably less relations [than Freebase] (1,874 vs. 70,802). The Wikidata methodology to let users propose new relations, to discuss about their coverage and reach, and finally to approve or disapprove the relations, seems to be appropriate."Lab mouse mg 3244.jpg'' had all its genes imported into Wikidata]]
 * "Wikidata as a semantic framework for the Gene Wiki initiative" From the abstract: "... we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59 721 human genes and 73 355 mouse genes have been imported from NCBI and 27 306 human proteins and 16 728 mouse proteins have been imported from the Swissprot subset of UniProt. … The first use case for these data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata are modified. … Apart from the Gene Wiki infobox use case, a SPARQL endpoint and exporting functionality to several standard formats (e.g. JSON, XML) enable use of the data by scientists."
 * "Connecting every bit of knowledge: The structure of Wikipedia's First Link Network" From the abstract: "By following the first link in each article, we algorithmically construct a directed network of all 4.7 million articles: Wikipedia's First Link Network. … By traversing every path, we measure the accumulation of first links, path lengths, groups of path-connected articles, and cycles. … we find scale-free distributions describe path length, accumulation, and influence. Far from dispersed, first links disproportionately accumulate at a few articles---flowing from specific to general and culminating around fundamental notions such as Community, State, and Science. Philosophy directs more paths than any other article by two orders of magnitude. We also observe a gravitation towards topical articles such as Health Care and Fossil Fuel." (See also media coverage: "All Wikipedia Roads Lead to Philosophy, but Some of Them Go Through Southeast Europe First" and Getting to Philosophy)