Wikipedia:Wikipedia Signpost/2016-01-27/Recent research

Burstiness in Wikipedia editing

 * Reviewed by Brian Keegan

Wikipedia pages are edited with varying levels of consistency: stubs may only have a dozen or fewer revisions and controversial topics might have more than 10,000 revisions. However, this editing activity is not evenly spaced out over time either: some revisions occur in very quick succession while other revisions might persist for weeks or months before another change is made. Many social and technical systems exhibit "bursty" qualities of intensive activity separated by long periods of inactivity. In a pre-print submitted to arXiv, a team of physicists at the Belgian Université de Namur and Portuguese University of Coimbra examine this phenomenon of "burstiness" in editing activity on the English Wikipedia.

The authors use a database dump containing the revision history until January 2010 of 4.6 million English Wikipedia pages. Filtering out pages and editors with fewer than 2000 revisions, bots, and edits from unregistered accounts, the paper adopts some previously-defined measures of burstiness and cyclicality in these editing patterns. The measures of editors' revisions' burstiness and memory fall outside of the limits found in prior work about human dynamics, suggesting different mechanisms are at work on Wikipedia editing than in mobile phone communication, for example.

Using a fast Fourier transform, the paper finds the 100 most active editors have signals occurring at a 24-hour frequency (and associated harmonics) indicating they follow a circadian pattern of revising daily as well as differences by day of week and hour of day. However, the 100 most-revised pages lack a similar peak in the power spectrum: there is no characteristic hourly, daily, weekly, etc. revision pattern. Despite these circadian patterns, editors' revision histories still show bursty patterns with long-tailed inter-event times across different time windows.

The paper concludes by arguing, "before performing an action, we must overcome a “barrier”, acting as a cost, which depends, among many other things, on the time of day. However, once that “barrier” has been crossed, the time taken by that activity no longer depends on the time of day at which we decided to perform it. ... It could be related to some sort of queuing process, but we prefer to see it as due to resource allocation (attention, time, energy), which exhibits a broad distribution: shorter activities are more likely to be executed next than the longer ones."

Emerging trends based on Wikipedia traffic data and contextual networks

 * Reviewed by Brian Keegan

Google Trends is widely used in academic research to model the relationship between information seeking and other social and behavioral phenomenon. However, Wikipedia pageview data can provide a superior – if underused – alternative that has attracted some attention for public health and economic modeling, but not to the same extent as Google Trends. The authors cite the relative openness of Wikipedia pageview data, the semantic disambiguation, and absolute counts of activity in contrast to Google Trends' closed API, semantic ambiguity of keywords, and relative query share data. However, Trends data (at a weekly level) does go back to 2004, while pageview data (at an hourly level) is only available from 2008.

In a peer-reviewed paper published by PLoS ONE, a team of physicists perform a variety of time series analyses to evaluate changes in attention around the "big data" topic of Hadoop. Defining two key constructs of relevance and representation based on the interlanguage links as well as hyperlinks to/from other concepts, they examine changes in these features over time. In particular, changes in the articles' content and attention occurred in concert with the release of new versions and the adoption of the technology by new firms.

The time series analyses (and terms used to refer to them) will be difficult for non-statisticians to follow, but the paper makes several promising contributions. First, it provides a number of good critiques of research relying exclusive on Google Trends data (outlined above). Second, it provides some methods for incorporating behavioral data from strongly related topics and examining these changes over time in a principled manner. Third, the paper examines behavior across multiple languages editions rather than focusing solely on the English Wikipedia. The paper points to ways in which Wikipedia is an important information sources for tracking publication and recognition of new topics.

"Hidden revolution of human priorities: An analysis of biographical data from Wikipedia"

 * Reviewed by Piotr Konieczny

This paper data mines Wikipedia's biographies, focusing on individuals' longevity, profession and cause of death. The authors are not the first to observe that the majority of Wikipedia biographies are about sportspeople (half of them soccer players), followed by artists and politicians. But they do make some interesting historical observations, such as that the sport rises only in the 20th century (particularly from the 1990s), that politics surpassed religion in the 13th century, until it was surpassed by sport, and so on. The authors divide the biographies into public (politicians, businessmen, religion) and private (artists and sportspeople) and note that it was only in the last few decades that the second group started to significantly outnumber the first; they conclude that this represents a major shift in societal values, which they refer to as "hidden revolution in human priorities". It is an interesting argument, though the paper is unfortunately completely missing the discussion of some important topics, such as the possible bias introduced by Wikipedia's notability policies.

"Women through the glass-ceiling: gender asymmetries in Wikipedia"

 * Reviewed by Piotr Konieczny

This paper looks into gender inequalities in Wikipedia articles, presenting a computational method for assessing gender bias in Wikipedia along several dimensions. It touches on a number of interesting questions, such as whether the same rules are used to determine whether women and men are notable; whether there is linguistic bias, and whether articles about men and women have similar structural properties (e. g., similar meta-data, and network properties in the hyperlink network).

They conclude that notability guidelines seem to be more strictly enforced for women than for men, that linguistic bias exists (ex. one of the four words most strongly associated with female biographies is "husband", whereas such family-oriented words are much less likely to be found in biographies of male subjects), and that as the majority of biographies are about men and men tend to link more to men than to women, this lowers visibility of female biographies (for example, in search engines like Google). The authors suggest that Wikipedia community should consider lowering notability requirements for women (controversial), and adding gender-neutral language requirements to the Manual of Style (a much more sensible proposal).

Wikipedia influences medical decisionmaking in acute and critical care

 * Reviewed by Tilman Bayer

A survey of 372 anesthesists and critical care providers in Austria and Australia found that "In order to get a fast overview about a medical problem, physicians would prefer Google (32%) over Wikipedia (19%), UpToDate (18%), or PubMed (17%). 39% would, at least sometimes, base their medical decisions on non peer-reviewed resources. Wikipedia is used often or sometimes by 77% of the interns, 74% of residents, and 65% of consultants to get a fast overview of a medical problem. Consulting Wikipedia or Google first in order to get more information about the pathophysiology, drug dosage, or diagnostic options in a rare medical condition was the choice of 66%, 10% or 34%, respectively." (A 2012 literature review found that "Wikipedia is widely used as a reference tool" among clinicians.)

Other recent publications
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.

Papers about medical content on Wikipedia and its usage

 * "How do Twitter, Wikipedia, and Harrison's Principles of Medicine describe heart attacks?" From the abstract: "For heart attacks, the chapters from Harrison's had higher Jaccard similarity to Wikipedia than Braunwald's or Twitter. For palpitations, no pair of sources had a higher Jaccard (token) similarity than any other pair. For no source was the Jaccard (token) similarity attributable to semantic similarity. This suggests that technical and popular sources of medical information focus on different aspects of medicine, rather than one describing a simplified version of the other."
 * "Information-seeking behaviour for epilepsy: an infodemiological study of searches for Wikipedia articles" From the abstract: "Fears and worries about epileptic seizures, their impact on driving and employment, and news about celebrities with epilepsy might be major determinants in searching Wikipedia for information."
 * "Wikipedia and neurological disorders" From the abstract: "We determined the highest search volume peaks to identify possible relation with online news headlines. No relation between incidence or prevalence of neurological disorders and the search volume for the related articles was found. Seven out of 10 neurological conditions showed relations in search volume peaks and news headlines. Six of these seven peaks were related to news about famous people suffering from neurological disorders, especially those from showbusiness. Identification of discrepancies between disease burden and health seeking behavior on Wikipedia is useful in the planning of public health campaigns. Celebrities who publicly announce their neurological diagnosis might effectively promote awareness programs, increase public knowledge and reduce stigma related to diagnoses of neurological disorders."
 * "Medical student preferences for self-directed study resources in gross anatomy" From the abstract: "To gain insight into preclinical versus clinical medical students' preferences for SDS resources for learning gross anatomy, [...] students were surveyed at two Australian medical schools, one undergraduate-entry and the other graduate-entry. Lecture/tutorial/practical notes were ranked first by 33% of 156 respondents (mean rank ± SD, 2.48 ± 1.38), textbooks by 26% (2.62 ± 1.35), atlases 20% (2.80 ± 1.44), videos 10% (4.34 ± 1.68), software 5% (4.78 ± 1.50), and websites 4% (4.24 ± 1.34). Among CAL resources, Wikipedia was ranked highest."

Papers analyzing community processes and policies

 * "Transparency, control, and content generation on Wikipedia: editorial strategies and technical affordances" From the abstract: "Even though the process of social production that undergirds Wikipedia is rife with conflict, power struggles, revert wars, content transactions, and coordination efforts, not to mention vandalism, the article pages on Wikipedia shun information gauges that highlight the social nature of the contributions. Rather, they are characterized by a “less is more” ideology of design, which aims to maximize readability and to encourage future contributions. ... Closer investigation reveals that the deceivingly simple nature of the interface is in fact a method to attract new collaborators and to establish content credibility. As Wikipedia has matured, its public notoriety demands a new approach to the manner in which Wikipedia reflects the rather complex process of authorship on its content pages. This chapter discusses a number of visualizations designed to support this goal, and discusses why they have not as yet been adopted into the Wikipedia interface."
 * "Policies for the production of content in Wikipedia, the free encyclopedia" From the abstract: "It is a case study with qualitative approach that had Laurence Bardin's content analysis as theoretical and methodological reference."
 * "Validity claims of information in face of authority of the argument on Wikipedia" From the abstract: "proposes to approach the claims of validity made by Jürgen Habermas in the face of the authority of the better argument. It points out that Wikipedia is built as an emancipatory discourse according to Habermas' argumentative discourse considering the process of discursive validation of information."
 * "Wikipedia and history: a worthwhile partnership in the digital era?"
 * "Is Wikipedia really neutral? A sentiment perspective study of war-related Wikipedia articles since 1945" From the abstract: "The results obtained so far show that reasons such as people’s feelings of involvement and empathy can lead to sentiment expression differences across multilingual Wikipedia on war-related topics; the more people contribute to an article on a war-related topic, the more extreme sentiment the article will express; different cultures also focus on different concepts about the same war and present different sentiments towards them."
 * "The heart work of Wikipedia: gendered, emotional labor in the world's largest online encyclopedia" (CHI 2015 Best Papers award, slides)
 * "Knowledge quality of collaborative editing in Wikipedia: an integrative perspective of social capital and team conflict" From the abstract: "Despite the abundant researches on Wikipedia, to the best of our knowledge, no one has considered the integration of social capital and conflict. [...] our study proposes the nonlinear relationship between task conflict and knowledge quality instead of linear relationships in prior studies. We also postulate the moderating effect of task complexity. [...] This paper aims at proposing a theoretical model to examine the effect of social capital and conflict, meanwhile taking the task complexity into account."

Papers about visualizing or mining Wikipedia content

 * "Visualizing Wikipedia article and user networks: extracting knowledge structures using NodeXL"
 * "Utilising Wikipedia for text mining applications" From the abstract: "Wikipedia ... has proven to be one of the most valuable resources in dealing with various problems in the domain of text mining. However, previous Wikipedia-based research efforts have not taken both Wikipedia categories and Wikipedia articles together as a source of information. This thesis serves as a first step in eliminating this gap and throughout the contributions made in this thesis, we have shown the effectiveness of Wikipedia category-article structure for various text mining tasks. ... First, we show the effectiveness of exploiting Wikipedia for two classification tasks i.e., 1- classifying the tweets being relevant/irrelevant to an entity or brand, 2- classifying the tweets into different topical dimensions such as tweets related with workplace, innovation, etc. To do so, we define the notion of relatedness between the text in tweet and the information embedded within the Wikipedia category-article structure."
 * "Integrated parallel sentence and fragment extraction from comparable corpora: a case study on Chinese-Japanese Wikipedia" From the abstract: "A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT [ statistical machine translation ] performance."
 * "How structure shapes dynamics: knowledge development in Wikipedia – a network multilevel modeling approach"' From the abstract: "The data consists of the articles in two adjacent knowledge domains: psychology and education. We analyze the development of networks of knowledge consisting of interlinked articles at seven snapshots from 2006 to 2012 with an interval of one year between them. Longitudinal data on the topological position of each article in the networks is used to model the appearance of new knowledge over time. [...] Using multilevel modeling as well as eigenvector and betweenness measures, we explain the significance of pivotal articles that are either central within one of the knowledge domains or boundary-crossing between the two domains at a given point in time for the future development of new knowledge in the knowledge base." (cf. earlier paper coauthored by the same researchers: "Knowledge Construction in Wikipedia: A Systemic-Constructivist Analysis")