Wikipedia:Wikipedia Signpost/2012-05-28/Recent research

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, edited jointly with the Wikimedia Research Committee and republished as the Wikimedia Research Newsletter.

Discourse on Wikipedia sometimes irrational and manipulative, but still emancipating, democratic and productive
An article in sociology journal The Information Society looks at interactions between Wikipedia editors and the project's governance, visible in the articles on stem cells and transhumanism, and in the analysis of Wikipedia's discussion of userboxes, all through the prism of Jürgen Habermas' universal pragmatics and Mikhail Bakhtin's dialogism theories.

The authors focus on the qualitative analysis of language used by editors, to argue that Wikipedia has elements of a democracy, and is an example of a Web 2.0–empowering discourse tool. They stress that some forms of discourse found online (including on Wikipedia) may be highly irrational, something that some previous arguments that Web 2.0 is a democratic space have often ignored, but they argue that this is in fact not as much of a hindrance as previously expected. Cimini and Burr remark that discourse can develop between Wikipedians of widely differing points of view, and that some editors will engage in "repeated, strategic, and often highly manipulative attempts" to assert personal authority. Such discussions may be very lively, involving "personal, emotional, or humour-based arguments", yet the authors argue that such comments may not be a hindrance; instead, "on many occasions, there is thus a clearer exposition of views that is achieved, in spite of, or perhaps because of, these personal [and] sometimes vulgar methods of argumentation."

In the end, the authors are positive about the success of Wikipedia's deliberation in reaching consensus, although they say that it can be "fleeting and transitory" on occasion. Unfortunately, the paper does not touch on Wikipedia policies such as Civility and No personal attacks, which would certainly have added to their analysis.

Despite the paper's claim to have received approval for research through a university research ethics committee, the paper does critically discuss the postings of specifically named editors ("[Editor A's] claim to authority and ad hominem attacks were met with derision by [Editor B]" (names replaced by the Signpost); this may raise eyebrows. Not all editors are 100% anonymous, which raises the question of whether the researchers did enough to protect the identity and reputation of the editors it cites. At the very least, why weren't the editors' usernames changed in the quotes? Their direct identification adds nothing to the article, and may expose the users to attack. (Similar questions have been discussed in the past by members of the Wikimedia Foundation Research Committee.)

Different language Wikipedias: automatic detection of inconsistencies
In a paper presented at the 4th International Conference on Intercultural Collaboration (ICIC), Kulkarni et al. offer a simple approach to support the work of Wikipedia editors who maintain articles concerning the same topic in multiple language versions. The long-term goal is to implement a bot that supports these specialized users by highlighting missing attributes and content inconsistencies. The analysis was focused on a pairwise comparison of infoboxes in different languages. First, the attribute-value pairs were extracted from the infoboxes and translated into English via Google translate. The identification of matching attribute names was achieved through direct text comparison with a set of synonyms obtained from WordNet (this step was included to handle mismatches caused by translation errors and variations). In a second step (the matching of attribute-values) the authors again used direct text comparative methods, and checked whether the values could be identified as homophones, to exclude mismatches caused by spelling mistakes in the text.

The evaluation data-set of these analyses and the whole pipeline included articles from English, German, Chinese and Hindi Wikipedias concerning two restricted domains: Indian cities and US-based companies. The evaluation revealed "a significant increase in recall after the concepts of homophones and synonyms were applied in addition to the direct text comparison." But the overall result was very weak, mainly due to translation errors. The authors noticed syntactic and semantic differences between the infoboxes, such as paraphrasing or different fact representations. "Also, abbreviations, unit conversion and geographic location matching [was not handled by their system]." The researchers plan to improve the system by addressing all of these issues in turn.

Finding deeper meanings from the words used in Wikipedia articles
An undergraduate computer science honors thesis at Trinity University (Texas) constructs a semantic graph from 451 articles, linked to from the World War II article. Ryan Tanner's goal is to produce a visualization "which allows one to quickly find and examine connections between the people, places and things described in Wikipedia". The process is as follows: Originally the goal was to visualize the whole of Wikipedia; however, due to problems with the dump, only 250,000 articles out of about 1.5 million were imported. An even smaller subset was ultimately usable, since the Stanford NLP library crashed on many of the remaining articles due to markup issues and the need for manual cleanup. To ensure a dense graph, tests were focused on the network of the World War II article. Some brief examples of the resulting graph are given in Chapter 10, which notes false positives as one problem requiring further investigation. The author makes suggestions for future research, such as using the Simple English Wikipedia or more complex relations.
 * 1) Import SQL dump from the Wikimedia Foundation into a local database
 * 2) Strip wiki markup from the articles using Bliki
 * 3) Parse articles with the Stanford NLP, using dependency grammars to extract facts and simplify sentences
 * 4) Parse the output from the Stanford library using Scala
 * 5) Read a Stanford XML file into a collection of models.
 * 6) Produce abstractions for named entities and locations.
 * 7) Input models into the algorithm developed for this thesis (see Chapter 7)
 * 8) Store results in a database.
 * 9) Traverse the resulting graph and produce user-presentable output.

How leaders emerge in the Wikipedia community
A paper titled "Leading the Collective: Social Capital and the Development of Leaders in Core-Periphery Organizations" looks at how leaders emerge in Wikipedia and similar crowd-based organizations. While often seen as egalitarian and with little hierarchy, such projects always have a group of leaders who have emerged from the community (the "crowd"), involved in planning, mediation, and policy development. The authors treat Wikipedia and similar organization as a core–periphery network model developed by Steve Borgatti—a system with a deeply interconnected center and a poorly connected periphery. In Wikipedia, the leaders ("core") comprise the most active contributors, and the authors assume they produce the most social capital. Using social network analysis, the paper looks at the interpersonal ties between the editors, focusing on the ties between leaders and periphery. The hypothesis is that specific types of ties will have a greater influence on advancement to leadership.

The authors collected data from RfA pages, and the ties were measured through user-talk-page interactions. Leaders were defined as admins, and periphery editors as non-administrators; this operationalization may raise some doubts about the validity, since some very active and prominent members of the community are not admins, something the authors do not address. The authors find that the most important ties are the early ones to the periphery, and later, ties to the leaders. Overall strong ties are not as important as weak ties, although Simmelian ties (between pairs of leader groups) are among the most important.

Collier and Kraut conclude that leaders in projects such as Wikipedia do not suddenly appear; instead, they evolve over time through their immersion in the project's social network. Early in their experience, those leaders gain a deeper understanding of the community, developing a network of contacts through their weak ties to the periphery; later, their most important ties are to the leaders, particularly in the form of strong connection to a leader group.

Identifying software needs from Wikipedia translation discussions
A paper presented at an international conference on intercultural collaboration aims "to identify the type of community interaction needed for successfully creating or amending an article via Wikipedia translation activities", and proposes new software tools to facilitate these interactions. To this end, the researchers from Kyoto University analyzed 1694 talk-page comments from three Wikipedias, belonging to articles in categories marking (partial or complete) translations (e.g. fr:Catégorie:Projet:Traduction/Articles_liés): 228 articles from the Finnish, 93 from the French, and 94 from the Japanese Wikipedia. They attempted to categorize (code) each comment according to which "activity" it referred to (either editing the article or translating it), about which "context" it was referring to (using the categories "content", "layout", "sources", "naming", "significance" and "wording"), and which action was intended (requesting or providing help, requesting an edit, announcing an edit that the user had made, criticizing the article without a direct request for action, coordinating actions between users, or referring to an established Wikipedia policy).

Regarding comments focused on the activity of editing, the "results were consistent with previous research, with a high frequency of discussion contributions about content and layout". The authors found that "the Japanese Wikipedia was the only one with more discussion contributions about layout than content when the discussion was about editing activities (40.18%)" and speculate that this is because "in the older, or larger, Wikipedias, practices and policies are likely to be better established than in the younger, or smaller, Wikipedias leading to a lower frequency of discussions about layout." (However, they later point out that the Finnish Wikipedia, rather than the Japanese, is the smallest and youngest among the three examined ones, noting that it shows a much higher frequency of discussion about policy—15.0%, versus 6.0% on the French and 3.3% on the Japanese Wikipedia.) In this class of comments, "discussions about citing sources were relatively common in the Finnish and French Wikipedias (18.8% and 12.4%, respectively). In the Japanese Wikipedia, sources were less common with 7.1% of all discussion contributions regarding editing activities."

Most discussions about translation activities were about naming—that is, "resolving the proper form for the title of the article, section or sub-section, names or proper nouns, and transliteration in the corresponding article", contrasting the researchers' initial hypothesis that such discussion would "have a high frequency of contributions regarding translation of specific words and expressions" (their "naming" category "does not include phrasing or resolving proper translation of individual words or expressions"). As one reason, they identify "the diversity in naming practices of events between different language sources, such as mass media. Especially in the Finnish Wikipedia, discussion about sources was common (16.15%). These two topics are loosely related, as direct translations of the names of well-known events are often not acceptable in the target language Wikipedia."

Having identified naming issues and the search for suitable sources in the target language as "key problems" emerging in the translation discussions, the authors conclude that "the current approaches for supporting Wikipedia translation are not necessarily solving the main problems in Wikipedia translation" and proceed to suggest two "directions for designing supporting tools for Wikipedia translation, especially through open source development of MediaWiki extensions":
 * "Support for consistent translation of names and proper nouns", e.g. by making a "user editable multilingual dictionary resource" directly accessible in the design, and enabling editors to "coordinate through discussion pages directly related to a specific dictionary or dictionary entry in order to resolve inconsistencies in a centralized repository"
 * "Support for citing sources in translated articles", by offering an automatic search for sources that have themselves been translated into the target language and/or the development of a supporting "crowdsourcing translation tool for open content sources not available in the target language using machine translation"

The paper makes references to previous work on Wikipedia translation (including the authors' own), but does not mention the EU-supported CoSyne project, which aims to integrate tools with MediaWiki that "automate the dynamic multilingual synchronization process of Wikis" and would seem to have a lot of overlap with the kind of tools discussed in the paper.

New algorithm provides better revert detection
A paper by three researchers affiliated with the EU-supported RENDER project (to be presented at next month's "Hypertext 2012" conference) promises "accurate revert detection in Wikipedia". The article starts by describing the detection of reverts as "a foundational step for many (more elaborated) research ideas, [whose] purposeful handling leads to a superior understanding of wiki-like systems of collaboration in general", giving an overview over such research. (Revert detection has also been used in tools for the use of the editing community, such as this one that identify articles on the German Wikipedia that are currently controversial.)

Overviewing the "state-of-the-art in revert detection", the authors criticize the prevalent "identity revert detection method" (SIRD) which relies on finding identical revisions using MD5 hashes, arguing that it does not fully match the definition of a revert in the (English) Wikipedia's policies at Reverting: The SIRD method "does not require the reverting edit to actually undo the actions of an edit identified as reverted ... [Furthermore, it] is not possible to indicate if the reverting edit fully, partly or not at all undid the actions of the reverted edit ... It also does not require the intention of the reverting edit to revert any other edit." (Still, mainly due to requests by researchers, MD5 hashes have been integrated directly into the revision table stored by MediaWiki recently, necessitating considerable technical efforts when updating the existing databases for Wikimedia projects.)

The paper then presents the authors' new method for revert detection, which still aims to detect full reverts and to avoid false positives, while coming closer to the Wikipedia community's definition. It is implemented as an algorithm based on splitting the revisions' wikitext into word tokens (and made available online as a Python script). Also, MD5 hashes are still used on a paragraph level to be able to detect unchanged paragraphs easily and speed up computation. The algorithm was then evaluated by a panel of Wikipedians recruited on the English Wikipedia in comparison with the existing SIRD method.

As summarized by the authors, this user study found the new method to be "more accurate in identifying full reverts as understood by Wikipedia editors. More importantly, our method detects significantly fewer false positives than the SIRD method [27% in the sample, which however was somewhat small]". As a drawback, the authors note "the increased computational cost. As [the new algorithm] is quadratic over the number of words in the DIFFs [the changed text between subsequent revisions], in its current implementation it might not be the tool of choice if larger amounts of articles are to be analyzed; especially in the case of complete history dumps of the large Wikipedias, e.g., English, German or Spanish."

Briefly

 * The history of art mapped using Wikipedia: A paper by four researchers from Vienna, to be presented at next month's WebSci 2012 conference examines the wikilinks between 18,002 Wikipedia articles about artists (or more precisely "art-historical actors", derived from the English Wikipedia via DBpedia), from present times back to ancient Greece. A first result appears to confirm the assumption that artists are more likely to be influenced by or related to their contemporaries: "the number of short links covering 0–37.5 years clearly outnumbers the sum of all the other .... This can be interpreted as such that contemporaries are much more likely to be interlinked than persons who are generations apart". They present a visualization of the link graph colored by nationality of the person, which "reveals interesting patterns of cultural interaction within the network, as they are perceived by the English speaking Wikipedia community: The left side ... is dominated by Italians (green). This cluster spans Renaissance and Baroque times, fading out by the end of the 17th century. A small cluster on the lower left represents German Renaissance around Albrecht Duerer (black) ... The rightmost part represents Post-Modernist Americans, with a nationality-independent cluster of Architects beneath."
 * The use of references in Wikipedia coverage of current events: On the blog of Ushahidi, Wikimedia researcher Heather Ford described preliminary "key findings" from an ongoing project examining the use of sources in Wikipedians work on current events such as the 2011 Egyptian Revolution: "1. The source  of the page can play a significant role 2. Primary sources are gradually replaced by secondary sources," 3. The cite is not always the same as the source ("the citation that editors use to back up a particular phrase are not always the same as the source from which they receive their information"), 4. The blurring of boundaries along traditional “reliable sources” lines. Her "design recommendations include the design of source management systems around the kind of collaboration that is already working on Wikipedia: where editors collaborate around specific news stories, checking to see whether the source actually reflects the information in the article, whether the source is accurately contextualized, whether other media verify the facts in the article and whether there is any accompanying multimedia."
 * Distribution of article title lengths: A statistical analysis of the length of the more than 40 million article titles on all Wikipedias (including redirects) found that 90% are shorter than 32 characters and 98% are shorter than 53 characters. The blog post by Denny Vrandečić, head of the Wikidata development team—who generated those stats to inform some design decisions for this project—provides charts of the length distribution for each language, exhibiting some interesting differences and similarities (e.g. the distributions for the English, German, French, Polish and Russian Wikipedias, as well as the overall one, peaks around 13 characters).
 * To understand a Wikipedia article, which others does one need to read first?: A paper titled "Crowdsourced Comprehension: Predicting Prerequisite Structure in Wikipedia starts from the assumption that "the primary reason that technical documents are difficult to understand is lack of modularity: unlike a self-contained document written for a general reader, technical documents require certain background knowledge to comprehend—while that background knowledge may also be available in other on-line documents, determining the proper sequence of documents that a particular reader should study is difficult". Trying to develop a method to solve this problem in the example of five Wikipedia articles (global warming, meiosis, Newton's laws of motion, parallel postulate and public-key cryptography), the researchers analyzed the structure of wikilinks, whether pages had been edited by the same users, and the page text itself, and had Mechanical Turk workers decide in advance for many pairs of linked articles (within a subject domain) whether one was a prerequisite to understand the other. They conclude that "while it is not immediately obvious that this task is feasible, our experiments suggest that relatively reliable features to predict prerequisite structure exist, and can be successfully combined using standard machine learning methods".
 * High-conflict areas may deter uninvolved users: A student thesis from Macalester College, titled "Characterizing Conflict in Wikipedia" examines editing disputes between Wikipedians that concern several articles, pointing out that much of the previous research has only looked at such conflicts one article at a time. The analysis involved clustering 1.4 million articles. Among the conclusions is that "The vast majority of conflicts are very small, but there are still thousands of conflicts involving at least one hundred users. Conflicts between small numbers of users, or with small numbers of reverts, tend to span only one article, whereas larger conflicts tend to span more than one article." Also, within a conflict cluster, "contributions from users uninvolved in conflicts are even lower than those involved in conflicts. This indicates that users may be deterred from contributing to areas with high concentrations of conflict."
 * The vandalism revert and other temporal motifs, and their change from 2001 to 2011: A paper presented at ICWSM '12 looks&mdash;like several other recent papers&mdash;at the bipartite graph of editors and the articles they have edited, but enriches it "with temporal information of both who edited the article [discerning bots, IP editors, and admins], and how the article was changed [. This] enables discovering meaningful editing behavior in the form of network motifs. These temporal motifs are repeated subgraphs of the editing graph which correspond to significant patterns of collaborative interactions." (The concept of network motifs is popular in bioinformatics, where it is applied to gene regulatory networks. See also the review of an earlier paper applying a simpler kind of motif to analyze the editors-articles graph: "Collaboration pattern analysis: Editor experience more important than 'many eyes'".) Motifs involving just a single author were the most frequent. As an example of the patterns that become visible by including temporal information, among the multi-author motifs those involving a revert "occur much faster, with 6,558 of all 13,961 such motifs having a median time under 5 minutes ... The strong correspondence between reverting an edit and combating vandalism suggests that such short durations are due to active participation by Wikipedia community members, such as the Counter Vandalism Unit, which actively monitors recent revisions for potential vandalism". The authors then look at how the frequency of their motifs has changed over the history of Wikipedia from 2001 to 2011, and find that "the trends suggest that the early growth was fueled by content addition from single authors or collaborating between two authors (B) and contributions from administrators. These early behaviors have given way to increases in behaviors associated with editing (A) and maintaining quality or vandalism detection (D)."
 * The Wikipedia research behind Google's new Knowledge Graph?: On May 16, Google introduced its Knowledge Graph, a semantic network drawing information from many different sources including Wikipedia, which Google uses to enhance its search engine results with semantic information—often appearing to include excerpts from the infobox and lead section of a particular Wikipedia article on the top right corner of the results page. Two days later, Google Research announced a paper by two Google employees titled "A Cross-Lingual Dictionary for English Wikipedia Concepts" describing the construction of "a resource for automatically associating strings of text [such as search terms] with English Wikipedia concepts", considering "each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL". The resulting dataset is available for download and described as having been "designed for recall [rather than precision]. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links".