Wikipedia:Wikipedia Signpost/2020-01-27/Recent research

Wikipedia as a learning resource (for programmers)

 * Reviewed by Isaac Johnson

"Understanding Wikipedia as a Resource for Opportunistic Learning of Computing Concepts" by Martin P. Robillard and Christoph Treude of McGill University and University of Adelaide and published in SIGCSE 2020 examines the utility of Wikipedia articles about computing concepts for novice programmers. The researchers recruit 18 students with varying computer science backgrounds to read Wikipedia articles about computing concepts that are new to them. The authors use a sample of four Wikipedia articles that appear frequently in Stack Overflow posts (Dependency injection; Endianness; Levenshtein distance; Regular expression). Side note: in a sample of 44 million posts on Stack Overflow that the authors process, 360 thousand (0.8%) have a Wikipedia link, pointing to 40 thousand different Wikipedia articles in aggregate. They indicate that this rate of linking to Wikipedia is similar on the Reddit subreddit r/programming as well. The participants are instructed to use a think-aloud method where they talk through what they are doing and thinking as they try to learn about the concept. The authors then analyzed the transcripts from these interviews to determine what themes were consistent across the students.

The researchers identified the following challenges to learning from Wikipedia:
 * Concept Confusion: if vocabulary or notation has a different meaning in other contexts, this can confuse those readers who think they know what they're reading (but don't).
 * Need for Examples: explanations are not always enough. Examples are often desired.
 * New Terminology: encountering too many unfamiliar terms can be frustrating for readers.
 * Trivia Clutter: peripheral information that is not core to learning a concept can make it hard to find the most useful information, especially for non-native readers.
 * Unfamiliar Notation: math notation and code in articles is generally not explained, which can create confusion for the reader if they are not familiar with it.

While the authors conclude that linking to more structured learning resources from Stack Overflow and related forums might be beneficial, this research clearly provokes some thought about how Wikipedia might be a more effective learning context. For instance, page previews are a clear improvement for readers who are not familiar with the concepts mentioned in an article. The other concepts emphasize the value of surfacing examples in articles, not relying on mathematical notation to explain a concept, and having a clear lede paragraph. Two other thoughts about this research:
 * The authors describe how computer programmers often end up at Wikipedia by way of Stack Overflow posts that link Wikipedia as a means of better understanding concepts mentioned in an answer. The ability of these communities to build on Wikipedia is a really lovely example of beneficial re-use. It has been examined more widely in work by Vincent et al. (see this past write-up).
 * As machine translation is explored as a means of supporting content creation (e.g., via the content translation tool) or of providing access to knowledge in other languages (e.g., automatic translations in search), it is useful to understand what articles are particularly difficult for novices to learn from, such as the computing concepts studied in this research. This is content that likely becomes even more confusing if imperfect machine translation leads to odd sentence structure or word choice. Perhaps it should be prioritized for cleanup by native speakers.

Briefly

 * A new public dataset on active editors by country has been released by the Wikimedia Foundation. (See meta:Research:Data for other data sources available to researchers)

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer and Miriam Redi

"A systematic literature review on Wikidata"
From the abstract:  "Despite Wikidata’s potential and the notable rise in research activity, the field is still in the early stages of study. Most research is published in conferences, highlighting such immaturity, and provides little empirical evidence of real use cases. Only a few disciplines currently benefit from Wikidata’s applications and do so with a significant gap between research and practice. Studies are dominated by European researchers, mirroring Wikidata’s content distribution and limiting its Worldwide applications." See also earlier coverage of a related paper by Piscopo et al.: "First literature survey of Wikidata quality research", and the following preprint

"Wikidata from a Research Perspective -- A Systematic Mapping Study of Wikidata"
From the abstract:  "[We conduct] a systematic mapping study in order to identify the current topical coverage of existing research [on Wikidata] as well as the white spots which need further investigation. [...] 67 peer-reviewed research from journals and conference proceedings were selected, and classified into meaningful categories. We describe this data set descriptively by showing the publication frequency, the publication venue and the origin of the authors and reveal current research focuses. These especially include aspects concerning data quality, including questions related to language coverage and data integrity. These results indicate a number of future research directions, such as, multilingualism and overcoming language gaps, the impact of plurality on the quality of Wikidata's data, Wikidata's potential in various disciplines, and usability of user interface."



"Extracting Literal Assertions for DBpedia from Wikipedia Abstracts"
From the abstract:  we present an approach for extracting numerical and date literal values from Wikipedia abstracts [apparently a reference to the lead section of Wikipedia articles or their first paragraph]. We show that our approach can add 643k additional literal values to DBpedia] at a precision of about 95%.

"Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph"
From the abstract and accompanying note:  "A major challenge for bringing Wikidata to its full potential was to provide reliable and powerful services for data sharing and query, and the WMF has chosen to rely on semantic technologies for this purpose. A live SPARQL endpoint, regular RDF dumps, and linked data APIs are now forming the backbone of many uses of Wikidata. We describe this influential use case and its underlying infrastructure [...] The data used in this publication is available in the form of anonymised Wikidata SPARQL query logs. [....] The paper has won the Best Paper Award in the In-Use Track of ISWC 2018."

"Who Models the World?: Collaborative Ontology Creation and User Roles in Wikidata"
From the abstract:  " we study the relationship between different Wikidata user roles and the quality of the Wikidata ontology. [...] Our analysis shows that the Wikidata ontology has uneven breadth and depth. We identified two user roles: contributors and leaders. The second category is positively associated to ontology depth, with no significant effect on other features. Further work should investigate other dimensions to define user profiles and their influence on the knowledge graph." See also comments about the paper by Daniel Mietchen, and earlier coverage of a related paper: "Participation patterns on Wikidata"

"The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits"
From the abstract:  "We ... study the evolution that [Wikidata] editors with different levels of engagement exhibit in their editing behaviour over time. We measure an editor’s engagement in terms of (i) the volume of edits provided by the editor and (ii) their lifespan (i.e. the length of time for which an editor is present at Wikidata). The large-scale longitudinal data analysis that we perform covers Wikidata edits over almost 4 years. We monitor evolution in a session-by-session- and monthly-basis, observing the way the participation, the volume and the diversity of edits done by Wikidata editors change. Using the findings in our exploratory analysis, we define and implement prediction models that use the multiple evolution indicators."

"Following the footsteps of giants: Modeling the mobility of historically notable individuals using Wikipedia"
This study found that the migration place for historically relevant people is limited to few locations, depending on discipline and opportunities.

"GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies"
This preprint presents a tool for extracting multilingual and gender-balanced parallel corpora at sentence level, with document and gender information.

"On the Relation of Edit Behavior, Link Structure, and Article Quality on Wikipedia"
This study found that on Wikipedia, controversial and high-quality articles articles differ from others, according to metrics quantifying editing and linking behavior.

"Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering"
This preprint presents a new graph-based recurrent retrieval approach to answer multi-hop open-domain questions through the Wikipedia link graph.

"Collectively biased representations of the past: Ingroup Bias in Wikipedia articles about intergroup conflicts"
From the abstract:  "we compared articles about the same intergroup conflicts (e.g., the Falklands War) in the corresponding language versions of Wikipedia (e.g., the Spanish and English Wikipedia articles about the Falklands War). Study 1 featured a content coding of translated Wikipedia articles by trained raters, which showed that articles systematically presented the ingroup in a more favourable way (e.g., Argentina in the Spanish article and the United Kingdom in the English article) and, in reverse, the outgroup as more immoral and more responsible for the conflict. These findings were replicated and extended in Study 2, which was limited to the lead sections of articles but included considerably more conflicts and many participants instead of a few trained coders. This procedure [identified] a stronger ingroup bias for (1) more recent conflicts and (2) conflicts in which the proportion of ingroup members among the top editors was larger. Finally, a third study ruled out that these effects were driven by translations or the raters’ own nationality. Therefore, this paper is the first to demonstrate ingroup bias in Wikipedia – a finding that is of practical as well as theoretical relevance as we outline in the discussion."

"People tend to do more when collaborating with more people" on Wikipedia
From the abstract:  "[We study] two planetary-scale collaborative systems: GitHub and Wikipedia. By analyzing the activity of over 2 million users on these platforms, we discover that the interplay between group size and productivity exhibits complex, previously-unobserved dynamics: the productivity of smaller groups scales super-linearly with group size, but saturates at larger sizes. This effect is not an artifact of the heterogeneity of productivity: the relation between group size and productivity holds at the individual level. People tend to do more when collaborating with more people. We propose a generative model of individual productivity that captures the non-linearity in collaboration effort. The proposed model is able to explain and predict group work dynamics in GitHub and Wikipedia by capturing their maximally informative behavioral features."