Wikipedia:Wikipedia Signpost/2011-07-25/Recent research

''This is the third overview of recent published research on Wikipedia and other Wikimedia projects (previous issues: June 6, April 11), intended to become a monthly feature published jointly with the Wikimedia Foundation Research Committee. In addition to a focus on covering research by academics outside Wikimedia, this issue includes contributions funded by the Wikimedia Foundation. If you want your research to be featured in this monthly newsletter, you can tell us about your work by submitting it to the Wikimedia Research Index.''

Edit wars and conflict metrics
A study covered in the previous edition of the research newsletter was extended and published by the authors on ArXiv. The authors report a new method for classifying how disputed a Wikipedia article is, to detect controversies and edit-wars. At its core, the method is based on looking at pairs of editors who have mutually reverted each other, and using their respective edit-counts to define an overall metric of conflict. Even though this formula is not immediately intuitive, the authors describe using special diagrams called "revert maps" on the Cartesian space that depict such pairs of editors. The authors use this classifier to select two samples of pages, of disputed and non-disputed topics,respectively, and analyze the time-series of revisions to these pages; while they find that both time series are characterized by bursts of user activity, they claim there is a qualitative difference between the two, although their analysis appears to lack any form of statistical hypothesis testing. They apply a priority-based model of editor activity that has been already proposed to explain human activity on the web, and find two distinctive patterns of activity that can help class "good" guys vs "bad" guys.

The anatomy of a Wikipedia talk page
Several pieces over the past month have focused on the structure and nature of social interaction on Wikipedia's discussion pages, both from quantitative and qualitative perspectives.

A study conducted by a team of researchers based in Milan and Barcelona, presented last week at ICWSM '11, looks into the properties of the social interaction of participants in discussions on talk pages. The paper highlights a number of methodological issues in studying social network properties in Wikipedia. Social ties in Wikipedia are implicit, insofar as there is no representation of an explicit link between two Wikipedia users. A conversation between users allows inference of an implicit social network. However, inferring such networks in Wikipedia is challenging for two main reasons: the lack of structure of talk pages (which makes conversations hard to parse), and the dispersal of discussion threads, both within a page and over multiple pages (for example, an article talk-page plus a variable number of personal user talk-pages). Despite these difficulties, the study analyzes the properties of two types of social networks centered on article discussions (those on article talk-pages and those that focus on an article but take place via replies on user talk-pages) and a user-centric social network (that is, the network defined by direct messages left by users on their talk pages). The three networks show interesting dissimilarities in terms of the in- and out-degree of their nodes and in the proportion of overlap between their edges, suggesting that user- and article-centered communications are supported by substantially different networks. The paper moves on to examine the degree assortativity of these networks—the tendency of users to create links with other users having a similar number of links. A striking difference emerges in the comparison with conversations in Slashdot, which are characterized by strong assortativity, and discussion networks in Wikipedia, which display a systematic dissortativity, an indication of the specificity of social interactions in Wikipedia compared with other social media. As the authors summarize, "Wikipedians who reply to many other users in article talk pages tend to interact mostly with users having few connections, i.e. newbies and inexperienced users, while the Wikipedians who receive replies from many users tend to interact preferentially with each other." The study moves on to consider the depth and popularity of article-centered discussions, and identifies metrics of the contentiousness of these discussions based on their depth and the number of mutual replies among users participating in the same thread. The research characterizes the size, frequency and structure of discussions across different article categories and finds that although “Geography” and “History” account together for almost half of all discussions in the English Wikipedia, they tend to host shallow threads, whereas “Philosophy”, “Law”, “Language” and “Belief” are characterized by the deepest discussions and involve the largest number of participants. Two of the authors gave a presentation at last month's Hypertext 2011 conference in Eindhoven: "Co-authorship 2.0: patterns of collaboration in Wikipedia". A group of researchers based at the University of Washington released an annotated corpus of discussions from Wikipedia talk pages encoding two types of social acts: alignment moves and authority claims. In the authors' own words, "an authority claim is a statement made by a discussion participant aimed at bolstering their credibility in the discussion. An alignment move is a statement by a participant which explicitly positions them as agreeing or disagreeing with another participant or participants regarding a particular topic". Studying discussions with the lens of authority and alignment can help to shed light on consensus-building strategies used by participants in Wikipedia discussions. The authors contend that the dataset offers qualitative materials that can be built upon to produce computational models of online debates. The data spans 365 discussions that occurred on 47 talk pages between 2002 and 2008, involving a total of 1,509 editors. After presenting the corpus, the study presents an analysis comparing editor activity metrics with the propensity of adopting one of the above social strategies. The authors introduce an editor’s v-index (or veteran index) defined as the greatest v, such that the editor has made at least v edits within the past v months and report that this indicator of editor activity positively correlates with the proportion of authority claims made in a discussion. Making an authority claim makes a user "significantly more likely to be the target of an alignment move within the subsequent 10 turns compared to turns that did not contain any claims". Researchers from the National University of Ireland, Galway presented work in progress from a project aimed at understanding Wikipedia coordination spaces and costs. In a paper presented earlier this year at SAC '11 the authors discuss the results of a small series of semi-structured user interviews with Wikipedia administrators and editors. The results point at a number of drawbacks in the design of Wikipedia talk pages, suggesting that editors find it hard to keep up-to-date with temporally sparse discussions that are often scattered across multiple pages. The interviews suggest that talk pages often become the target of support requests by new editors that go unnoticed. The lack of connection between discussions and the article itself (for example, the lack of links between threads and specific sections or topics of the article) also emerges as one of the main weaknesses of Wikipedia talk pages. In the remainder of the paper the authors introduce a lightweight solution to allow the effective categorization of comments posted on article talk-pages by semantically enriching them with an RDF mark-up. This mark-up can then be exposed to end-users with the aid of a JavaScript bookmarklet, manipulated and exported via SPARQL, and potentially used to generate granular notifications. In a poster presented last month at WebSci '11, the same team of researchers gives an overview of work in progress on AfD discussions and illustrates with a diagram the complexity of deletion discussions and procedures in the English Wikipedia.
 * Wikipedia discussions shallow in geography and history, but deep in philosophy, law, language and beliefs.
 * Building consensus in talk pages: authority and alignment.
 * Shortcomings in the design of Wikipedia talk pages.

Wikipedians as "Janitors of Knowledge"
In a paper titled "Janitors of Knowledge: Constructing Knowledge in the Everyday Life of Wikipedia Editors", researcher Olof Sundin of Lund University applies concepts from Science and technology studies to an online ethnography study of the Swedish Wikipedia community, focusing on the role of references in particular.

He conducted interviews with eleven active users of the Swedish Wikipedia (out of 20 contacted via e-mail) who had given "informed consent according to the recommendations of the Swedish Research Council". Their activity, as well as discussion on the village pump and on the talk pages of some articles, were observed from August 2009 to February 2010. (The paper does not link diffs of the users' comments, due to privacy reasons.) They were between 20 and 50 years old, with diverse jobs and outside interests. Among other observations, the paper states that "For most of the informants the watch-list ... is the starting point for their [everyday] activities", and that Wikipedia is also a place for identity construction, .... For Wikipedia editors, to edit is not just something you do, it is also a part of who you are". The title refers to the finding that "Cleaning work [e.g. reverting vandals] seems to be the central activity for almost all of the participants" of the interviews. The informants state that citing references has become more important on Wikipedia in recent years, also evidenced by the introduction (in November 2009) of a requirement to cite at least one reference in the criteria for inclusion of new articles in a  "New Written Articles of the Week" page (similar to the English Wikipedia's Did You Know). One section is devoted to Wikipedia's "hierarchy of references" (by reliability), mentioning the Swedish Wikipedia's equivalent of WP:RS.

As theoretical framework, Sundin uses an actor-network theory interpretation of Wikipedia, which he explains as follows: "Within such a perspective, the editors, form and functions, core policies, guidelines of Wikipedia, its millions of articles and discussions, references, and users around the world can all be seen as actors, as they make each other do something; they construct, uphold and transform Wikipedia as we know it. An actor, for instance a functional feature in Wikipedia called the watch-list, that makes it easier for the editors to scan new contributions, or a policy document, makes other actors act in a particular way. ... Some actors have a more central role than others and some of these, if we draw on Callon (1986), are so central that they can be called obligatory passage points. An obligatory passage point can be thought of as a threshold that other actors need to pass or adjust to." As such an obligatory passage point in Wikipedia's network of actors, Sundin identifies the Verifiability policy.

Use of Wikipedia among law students: a survey
An article in The Law Teacher titled: "Embracing Wikipedia as a research tool for law: to Wikipedia or not to Wikipedia?" describes an anonymous survey among 101 Australian students (30 senior secondary high school students enrolled in legal studies, and 71 law degree students in their first and second year at the University of Southern Queensland) about their use and perceptions of Wikipedia. Their results indicate "that the majority (78%) of all students surveyed are currently using Wikipedia for some form of legal (30%) or other research (37%) or as a source of general information (11%)." One of the 101 students admitted to have vandalized Wikipedia articles, while two said they corrected errors in Wikipedia. The use of Wikipedia for legal research among the first-year university students was much lower than among the high-school students, which the authors conjecture is "a result of legal research skills training and warnings against its use, and perhaps even a result of cultural adaptation. Seventy-eight percent of the first year law students surveyed acknowledged that Wikipedia can be unreliable and/or inaccurate." However, Wikipedia usage for legal or other research increased again for the second year university students, which the authors surmise could have to do with the students becoming "a little more streetwise within the university context and [finding] the convenience of Wikipedia appealing."

Apart from the poll results, the paper contains a small literature survey about "Wikipedia as a teaching and learning resource", observing that "the use of wikis in legal education is in its infancy. Several of the case studies in the literature reported positive outcomes," and qualitative results from an "informal preliminary investigation into academic perceptions of Wikipedia as a research source in law" ("All the academics consulted considered Wikipedia an unreliable source for legal information ... Some acknowledged a role for Wikipedia as a source for legal or incidental background information" with qualifications about accuracy and reliability). Still, "the authors argue that using Wikipedia as a tertiary source for assimilating broad overview information, for both legal and incidental research, to define and identify keywords for further research, and as a link to other resources, is acceptable when the issues surrounding the discerning use of any secondary source, peer reviewed or not, are fully understood", and that "Academics can and should contribute to Wikipedia either directly, through the contribution of research, or indirectly, through the mentoring of student contributions which can be incorporated into course content and assessments." Among other conclusions, the authors suggest "encouraging universities to develop policies consistent with academic contribution to Wikipedia".

Miscellaneous

 * Turning back Wikipedia's clock: In a paper titled "Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History", three researchers from the Darmstadt University of Technology present software that allows easy access of the state of Wikipedia corresponding to a particular point in time, both for single article revisions and for whole history dumps up to that moment. As motivation, they note that large-scale access of single revisions via Wikipedia's own API is inefficient since data needs to be transmitted over the Internet, whereas the downloadable XML dumps provided by the Wikimedia Foundation are in a format that doesn't allow easy access of single revisions. They emphasize the importance of these dumps for Natural Language Processing analyses of Wikipedia, and that the reproducibility of such research is jeopardized by the fact that "older snapshots [are] becoming unavailable as there is no official backup server." The authors' solution is realized as an extension of the existing Java Wikipedia Library (JWPL). To store the dump in a format that allows fast access to revisions but still saves space, they developed their own diff algorithm based on a longest common substring search.
 * On his personal blog, Paolo Massa (User:Phauly) gave a "Report of the ACM Hypertext 2011 conference" from the perspective of a Wikipedia researcher.
 * The inaugural issue of Critical Studies in Peer Production, a new open access academic journal, published an article by Mathieu O'Neil titled: "The sociology of critique in Wikipedia". All contents from this journal are CC-BY licensed.
 * A blog posting titled "Who writes Wikipedia? An information-theoretic analysis of anonymity and vandalism in user-generated content" referenced a widely cited study by Aaron Swartz (see also this week's "In the news"), who in 2006 found that anonymous users contributed much more of Wikipedia's content than the core of registered users. Instead of examining the text that survived to the current version, the blogger looked at reverted/unreverted edits as a crude measure of quality, and instead of counting edits, measured "the information-theoretic gain in each revision", as measured by LZMA compression. For performance reasons, the analysis was restricted to pages starting with the letter "M". Among various other findings, the post states that "Registered users dipped to contributing as much vandalism as content in 2007, and have taken an upswing to over three times as much good content. Anonymous users dipped to contributing as much vandalism as content in 2005, and through 2010 are contributing roughly twice as much vandalism as content".
 * Using expertise credits from Citizendium to recommend Wikipedia articles: An article in this month's issue of the "Journal of Information Processing", titled "Classification of Recommender Expertise in the Wikipedia Recommender System", reports improvements in the existing "Wikipedia Recommender System", a "collaborative filtering system with trust metrics, i.e., it provides a rating of articles which emphasizes feedback from recommenders that the user has agreed with in the past", by considering the recommenders' areas of expertise. To determine these areas, the paper uses the self-reported expertise areas that Citizendium contributors have to state when signing up for one of that online encyclopedia's topic work groups.
 * A team of Brazilian researchers from Universidade Federal de Minas Gerais presented at JCDL '11 a tool called GreenWiki, designed to "help improve users awareness about the quality of a Wikipedia article as well as their assessment of it".
 * An article in last month's issue of Information Research examined "The search queries that took Australian Internet users to Wikipedia". The results "suggest that Wikipedia is used more for lighter topics than for those of a more academic or serious nature. Significant differences among the various lifestyle segments were observed in the use of Wikipedia for queries on popular culture, cultural practice and science".

Wikipedia research at OKCon 2011
On June 30 – July 1, the Open Knowledge Foundation held their annual meeting, the Open Knowledge conference (OKCon), this time in Berlin. On the first day, a workshop on Wikipedia & Research took place, organized by Mayo Fuster Morell (member of the Research Committee of the Wikimedia Foundation), who agreed to report back for the Signpost.

A message was already sent by the simple observation that the room was packed with around 50 people, some of them even sitting on the floor. In a tweet, Philipp Schmidt from P2P University commented: "wikipedia research community growing and diversifying. I remember meetings with 5 people, now the room is packed. Great!". The attendance at the workshop is a sign that there is high interest in the question of promoting research around Wikipedia. Furthermore, the good response could be seen from a double perspective: because addressing the questions is considered as important per se, but also in terms of good timing – a question of the right moment.

Since 2005, there has been an increasing interest within the scientific community in researching Wikipedia. In 2011, ten years after Wikipedia started, research on Wikipedia keeps growing, with a body of research and a community of researchers in place. In this regard, according to a recent review, there is currently a total of 2,100 peer-reviewed articles and 38 doctoral theses related to Wikipedia. The willingness to collaborate, to make use of synergies between research initiatives of various kinds, and to continue innovating (in what is already constituting one of the leading nodes of methodological innovation) have also increased and continue to mature. It seems that in 2011 and the coming years, we will see not only the continuation in terms of a quantitative increase, but also a qualitative jump towards a more organized and challenging stage of research initiatives from and around Wikipedia. This can be expected to translate into important changes at the research level, and the initiative of research being promoted by Wikipedia (not only about Wikipedia) is likely to be well received.

During the workshop, Mathias Schindler (from Wikimedia Deutschland) presented the RENDER project – a research project looking at knowledge diversity, which is the first experience of a Wikimedia Chapter engaging in a large research project with other research partners at the European level.

Mayo Fuster Morell presented how Wikipedia had evolved over the years. Starting with quantitative analyses of large data sets and on the English version of Wikipedia as the predominant approach in early empirical research on Wikipedia, the focus then expanded to conducting research on other language versions, covering a larger variety of issues, such as socio-political questions, and also adopting qualitative methods. She also presented the Research Committee, a committee created by the WMF staff consisting of Wikimedia volunteers, researchers, and Wikimedia Foundation staff with the mandate to help organize policies, practices and priorities around Wikimedia-related research).

Daniel Mietchen (likewise a member of the Research Committee of the WMF) presented the draft for an open access and open data policy of the WMF as a requirement for research projects receiving significant WMF support.

Benjamin Mako Hill (Wikimedia Foundation Advisory Board member and intellectual property researcher at MIT, among others) was also present, but stepped back from his planned intervention in favor of allowing time for debate. During the discussion, the question of open data was the central theme of interest to the floor. Other than that, interest was also expressed in the question of data repositories.

The schedule was tight, and the session ended well before the discussions could have reached a conclusion. It remains clear that a continuation of the discussion is needed as much as occasions to meet and develop things together around Wikipedia research and promoting another way of doing research.

Wikimedia Summer of Research: Three topics covered so far
The "Wikimedia Summer of Research" (WSoR, see previous coverage) is a three-month program (ending in September), sponsored by the Wikimedia Foundation, which has brought together a team of eight academics working in the Foundation's Community Department. The goal is to study the dynamics of the editing community, starting with English and focusing particularly on which factors can measurably be said to affect the decline in new editors. The following is a short look at three of the many areas studied so far. Other research can be found on Meta and on Commons.

How new English Wikipedians ask for help
The early weeks of research by Jonathan Morgan, R. Stuart Geiger, and Shawn Walker were focused on how new editors find and interact with help spaces, both within and outside the Help namespace. A combination of qualitative and quantitative methods have been used to address this issue, but the primary data was gathered through qualitative coding of randomized samples of new editors.

The following two charts were derived from the coding of activities by 445 new Wikipedians distributed from 2009–11.

Who edits trending articles on the English Wikipedia
One question that was directed at the summer research team was whether trending articles – such as those about breaking current events or in "In the news" on the Main Page — attract a significant number of new editors compared with articles not affected by current events. Adjacent questions were whether those new editors who registered because of interest in unfolding-event articles are more or less likely to become repeat editors of the encyclopedia.

Using a quantitative sample of a random 20% of the thousands of articles which were trending (in terms of traffic stats) in January 2011, this study by Yusuke Matsubara showed that, perhaps surprisingly, the number of newly registered editors who participate in unfolding-event articles is proportionally quite low. However, the amount of participation from anonymous editors was more significant regardless of semi-protection. This suggests that there may be an opportunity to invite good-faith anonymous contributors on trending articles to participate further by registering accounts.

The workload of new-page patrollers and vandal fighters
One of the theories that has been proposed about the decline in participation by new editors is that newbie biting has increased over the years because more of the burden of policing vandalism, spam, etc. has been shouldered by fewer and fewer active new-page patrollers and vandal fighters, which contributes to burn out. To test this theory, summer researcher Aaron Halfaker looked at the workload of new-page patrollers and vandalfighters since 2007 overall. It found that, like many things in Wikipedia, the trends follow a power law where the top contributors do most of the work. However, contrary to the hypothesis, the number of patrolling actions per editor (by both month and year) has been decreasing steadily.