Wikipedia:Wikipedia Signpost/2016-05-02/Recent research



"Who did what: editor role identification in Wikipedia"

 * Reviewed by Guillaume Paumier

Who did what: editor role identification in Wikipedia is the title of an upcoming paper to be presented at the International Conference on Web and Social Media (ICWSM) in Cologne, Germany. The work presented in the paper is a collaboration between researchers from Carnegie Mellon University and the Wikimedia Foundation. The authors' goal was to analyze edits from the English Wikipedia to identify roles played by editors and to examine how those roles affected the quality of articles.

Identifying roles of participants in online communities helps researchers and practitioners better understand the social dynamics that lead to healthy, thriving communities. This line of research started in the 2000s, focused on Usenet groups, before expanding to wiki communities like Wikipedia.

The paper covers the three stages of work:
 * determining edit categories, to describe the types of changes made by editors;
 * modeling editor roles represented by those categories;
 * measuring the quality of a set of articles over time, and determining if its evolution is linked to the roles played by their editors.

For the first stage, the authors built on previous publications that aimed at classifying Wikipedia edits, in particular the work by Daxenberger et al. Classifying edits usually starts by separating them by namespace. A more granular approach considers not just the namespace, but the content of the change. This was the method chosen here for edits in the main namespace, with the possibility of assigning a revision to multiple categories: for example, a single revision can entail both "grammar" and "template insertion" changes. Those categories were operationalized using an ensemble method classifier based on the content and metadata of the edit.

Then, the authors derived roles based on patterns that emerged from the classes of edits, using the latent Dirichlet allocation method (LDA). This method is traditionally used in natural language processing to identify topics making up a document. Here, the authors used the method to identify roles making up a user, positing that a user is a mixture of roles in the same way that a document is a mixture of topics. In addition to edits, they trained the LDA model using other information such as reverts, and edits in other namespaces.

They ended up with eight roles: social networker, fact checker, substantive expert, copy editor, wiki gnomes, vandal fighter, fact updater, and Wikipedian. They found that most editors play between one and three of those roles. To validate the roles, they attempted to predict edit categories based on the editors' roles, with mixed results.

Last, the authors examined whether the roles of editors were correlated with the evolution of the quality of a set of articles. They measured article quality twice, six months apart, using an existing model that assigns a score in Wikipedia's qualitative assessment scale based on the article's measurable characteristics.

They found some correlation between the difference in quality and the roles involved, taking into account control variables like the starting quality score. Their results suggest that "the activities of different types of editors are needed at different stages of article development". For example, "as articles increase in quality, the substantive content added by substantive experts is needed less" but "the cleanup activities by Wiki Gnomes become more important".

One limitation acknowledged by the authors is that their detailed edit classification was only performed on edits made in the main namespace (Wikipedia articles). For other edits, they only considered the namespace itself. Namespaces like  are host to very varied activities, and applying the same level of detail to them would presumably yield a richer, and possibly more accurate, taxonomy of roles.

Some choices in the role nomenclature are a little surprising. For example, it seems odd to have one role simply called "Wikipedians", or "reference modification" being a behavior representative of "social networkers". Translating patterns of data (structural signatures) into words (roles) is a difficult exercise, and often a weak link in such analyses.

In conclusion, the article is a welcome contribution to the field of Wikipedia research, in particular of editor roles on Wikipedia. Many previous role identification efforts have used a simplified approach where editors were reduced to their main role. In contrast, here the authors went further and considered editors as a mixture of roles, which is expected to provide a more accurate representation of human behavior.

Since the authors mention task recommendation as a possible application of their work, it would be particularly interesting to examine how the role composition of a user evolves over time. There may be patterns in the evolution of users' roles during their life cycle as editors. Uncovering such patterns could lead to more relevant task recommendations, and help guide editors along their contribution journey.

"Verifying social network models of Wikipedia knowledge community"

 * Reviewed by Brian C. Keegan

This paper was published in the Information Sciences journal and was co-authored by researchers from several Polish universities. The paper's central research question is "are the popular assumptions about the social interpretations of networks created from the edit history valid?" The paper evaluates four different methods for constructing complex networks from Wikipedia data and comparing these constructs with survey results about Polish Wikipedians' self-reported relationships. While there is a strong correspondence between all the different network types, networks derived from Wikipedians' talk pages map most clearly onto Wikipedians' feelings of acquaintanceship.

The paper examines four kinds of relationships: co-edits to article and user talk pages (acquaintanceship), co-edits in the vicinity of other users' text (trust), reverts of editors' revisions (conflict), and co-edits to articles in the same category (shared interest). Crucially, the paper extends prior research using these network constructs by conducting a respondent-driven survey of Wikipedians to ask them to name other Wikipedians they consider to be acquaintances, trusted, conflict-prone, or having the same interest. The survey respondents tended to be more experienced than typical users and so responses were re-weighted based on population frequency.

The paper goes on to use a variety of machine learning methods to evaluate the strength of the relationship between different behavioral features and the self-reported relationships. First the find that naive constructions of these networks from behavioral data only end up predicting one kind of relationship (discussion/acquaintanceship). Using more complex sets of temporal features such as days since last edit and category similarity to account for biases in self-reporting yielding only marginal improvements in model performance. The authors conclude by suggesting that the correspondence between relationships imputed from observed Wikipedia data and the relationships reported by Wikipedians themselves are weak.

The survey methods employed in this paper to generate the ground-truth networks can be criticized by the lack of randomness in the population or generalizability across other wiki communities. Similarly, there are well-known limits on informant accuracy compounded by the often impersonal nature of the editing interface and process. Nevertheless, this research suggests that researchers combining behavioral data social network methods may be making faulty assumptions about how strong the observed relationships are actually perceived by the Wikipedians themselves.

"Tracking interactions across business news, social media, and stock fluctuations"

 * Reviewed by Brian C. Keegan

This study from researchers at the University of Helsinki examines cross-correlations between Wikipedia pageviews, news media mentions, and company stock prices. This work extends prior work that developed a trading strategy based on Wikipedia pageviews to assess stock market moves by extracting entities about companies, products, and dates from news media mentions and matching them to Wikipedia entries. An exploratory case study demonstrates there are some correlations across these three indices and that the strongest cross-correlations are observed without a time lag and for the same company. However, in a subsequent case study involving 11 large companies, the strongest cross-correlations were for The Home Depot and Netflix. That correlations among news mentions, Wikipedia pageviews, and stock performance is neither theoretically nor empirically surprising, but the paper's work on identifying entities and mapping them to Wikipedia articles could have some potential. Research like this comparing correlations across dozens of entities and time series is subject to multiple comparisons problems and there's likewise a large body of methods in mathematical finance that can be used to extend these findings further.

New event calendar
A calendar of events (mostly research conferences) relevant to Wikimedia-related research has recently been set up on Meta-wiki. Notable entries for this month include CHI 2016 and ICWSM-16.

"Detection of text-based advertising and promotion in Wikipedia by deep learning method"

 * Reviewed by Tilman Bayer

This conference paper presents a method to automatically detect promotional content in Wikipedia. It appears to aim at articles, but the actual method focuses on user pages.

The authors highlight the fact that their method is purely text-based, whereas "[c]urrently most researches about spamming in Wikipedia are focusing on editing behavior and making use of user’s edit history to do feature-based judging." (See, however, our earlier coverage of a related paper that reported success using stylometric, i.e. text-based features: "Legendary, acclaimed, world-class text analysis method finds you promotional Wikipedia articles really easily")

The researchers explain that a "traditional bag-of-words document vector representation" (counting only word frequencies) is insufficient. Instead, they "employ a deep learning method to obtain a word vector for each word and then apply a sliding window on each document to gradually gain the document vector." The classifier was trained on a dataset of user pages speedily deleted under criterion "G11. Unambiguous advertising or promotion", compared to user pages of administrators which were assumed to be advertising-free. In tests (which apart from Wikipedia user pages also included a dataset of web page ads drawn from other sites) it "produced better performance than the bag-of-words model in both precision and recall measurements."

Other recent publications
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
 * "Taking Online Search Queries as an Indicator of the Public Agenda The Role of Public Uncertainty" From the abstract: "... the influence of media coverage on Wikipedia searches concerning two issues is compared: one issue with uncertainty (the Enterohaemorrhagic Escherichia coli [EHEC] epidemic), and one without uncertainty (unemployment). Analyses show much stronger correlations in the case of EHEC, which suggests that online search behavior may especially be used as an indicator of the public agenda in case public uncertainty exists." (see also: earlier publication by the same authors)
 * "Public relations interactions with Wikipedia" From the abstract: "The purpose of this paper is to consider the relevance of the institutional analysis and development (IAD) framework (Ostrom, 1990) in understanding the incentives for public relations (PR) practitioners’ interactions with Wikipedia, and other common-pool media. [... It] applies the economics theory of commons governance to two case studies of PR interactions with Wikipedia. [...] The analysis concludes that commons governance theory identifies the downside risks of opportunistic behaviour by PR practitioners in their interactions with media commons such as Wikipedia. ... The economic value of information held by public relations professionals has been undermined by the collaborative nature of common pool media, which has consequences for the role of public relations."
 * "Digital Library Citation Parsing to Wikipedia Reference Analysis" From the abstract: "... we empirically verify the potential of automatic identification of incomplete references in Wikipedia with our machine learning based software. Second we propose a normalization method on the identified result to group identical references. We evaluate the reference parsing performance of our system on author and title fields then provide Wikipedia citation statistics on the identified and normalized result."
 * "Collaboration of Pre-Service Early Childhood Teachers in Dyads for Wikipedia Article Authoring" From the abstract: "A prerequisite for pre-service teachers, among others, is their preparation to apply collaborative methodologies [...] In this paper, we discuss an approach involving collaboration of pre-service early childhood teachers in dyads while working on Wikipedia article authoring. To the best of our knowledge, there is no other such approach combining all these characteristics. [...] From an approach such as ours, benefits can be derived for students themselves, society and Wikipedia."
 * "A New Epistemic Culture? Wikipedia as an Arena for the Production of Knowledge in Late Modernity"
 * "Wikipedia, democracy and local elections in São Paulo: a study of the developing of articles edited during the election campaign in 2012" From the English abstract: "This article has as main objective to understand how during the campaign to elect the mayor of São Paulo city in 2012 users have accessed articles related to the three leading candidates – Russomanno, Haddad and Serra – and have edited them. We consider the period from 2008 to 2012 as a reference to explore variables such as the page view indices and the number of editions provided by frequent users. It was concluded that the intensification of the political battle during the election period affects the number of access to such articles, their editions, and the nature of the interaction between registered editors."
 * "Hot news detection using Wikipedia" From the blog post: "I show how we can use the number of Wikipedia article page views to determine if a news story is hot. This approach is fully data-driven and does not need any human supervision."