Wikipedia:Wikipedia Signpost/2013-06-26/Recent research

"The most controversial topics in Wikipedia: a multilingual and geographical analysis"
A comparative work by T. Yasseri., A. Spoerri, M. Graham and J. Kertész on controversial topics in different language versions of Wikipedia has recently been posted on the Social Science Research Network (SSRN) online scholarly archive. The paper, which will appear as a chapter of an upcoming book titled "Global Wikipedia: International and cross-cultural issues in online collaboration", to be published by Scarecrow Press in 2014, and edited by Fichman P., and Hara N., looks at the 100 most controversial topics in 10 language versions of Wikipedia (results including 3 additional languages are reported in the blog of one of the authors), and tries to make sense of the similarities and differences in these lists. Several visualization methods are proposed, based on a Flash-based tool developed by the authors, called CrystalView. Controversiality is measured using a scalar metric which takes into account the total volume of pairwise mutual reverts among all contributors to a page. This metric was proposed by Sumi et al. (2011), in a paper reviewed two years ago in this newsletter ("Edit wars and conflict metrics"). Topics related to politics, geographical locations, and religion are reported to be the most controversial across the board, and each language seems to feature specific, local controversies, which the authors further track down by grouping together languages with similar spheres of influences. Furthermore, the presence of latitude/longitude information (geocoordinates) in several of the Wikipedia articles in the sample analyzed in the study let the authors map the top controversial topics to a global world map, showing how each language features both local and global issues as the most heated topics of debate.

In summary, the study shows how valuable information about cross-cultural differences can be extracted from traces of Internet activity, though one obvious question is how the demographics of Wikipedia editors affect the representativeness of the results, an issue which the authors seem to be aware of, and which is probably going to play a role of increasing importance, as the field of cultural studies looks more and more at data generated by peer production communities.

The research has been intensely featured in the media, e.g., Huffington Post, Live Science, Wired.com, Zeit Online.



Sockpuppet evidence from automated writing style analysis
"A Case Study of Sockpuppet Detection in Wikipedia", presented at a "Workshop on Language in Social Media" this month, describes an automated method to analyze the writing style of users for the purpose of detecting or confirming sockpuppets. The abuse of multiple accounts (also known as "multi-aliasing" or sybil attacks in other contexts) is described as "a prevalent problem in Wikipedia, there were close to 2,700 unique suspected cases reported in 2012."

The authors' approach is based on existing authorship attribution research (cf. stylometry, writeprint). In a very brief overview of such research, the authors note that data from real-life cases is usually hard to come by, so that most papers are testing attribution methods on text that was collected for different purposes, and comes from authors that were not deliberately trying to evade detection. Whereas on Wikipedia "there is a real need to identify if the comments submitted by what appear to be different users belong to a sockpuppeteer".

Using the open-source machine learning tool Weka, the authors developed an algorithm that analyzes users' talk page comments by "239 features that capture stylistic, grammatical, and formatting preferences of the authors" - e.g. sentence lengths, or the frequency of happy emoticons (i.e. ":)" and ":-)"). Apart from features whose use is established in the literature, they add some of their own, e.g. counting errors in the usage of "a" and "an".

The paper examines 77 real-life sockpuppet cases from the English Wikipedia - 41 where the suspected use of sockpuppets was confirmed by "the administrator’s verdict" (presumably most of them based on Checkuser evidence), and 36 where it was rejected. For each case, the algorithm was first trained on talk page comments by the suspected sockpuppeteer (main account), and then tested on comments by the suspected sockpuppet (alternate account). On the average in each case less than 100 talk page messages were used to train or test the algorithm.

The system achieved an accuracy of 68.83% in the tested cases (for comparison, simply always confirming the suspected sockpuppet abuse would have achieved 53.24% accuracy on the same test cases). After adding features based on the user's edit frequency by time of day and day of the week, it achieved 84.04% confidence when tested on a smaller subset of the cases.

The authors remark in the introduction that "relying on IP addresses is not robust, as simple counter measures can fool the check users". In this reviewer's opinion, this probably underestimates the effort needed (for example, DSL or cable users simply resetting their modem to obtain a different dynamic IP most likely will not "fool" Checkusers). Still, a later part of the paper treats rejections of sockpuppet cases as definite proof that the accounts were not sockpuppets. Thus, they are possibly ignoring cases where a sockpuppeteer managed to avoid generating Checkuser evidence - in other words, some of results counted as false positives in this methodology might actually have been correct.

Looking forward, the authors write: "We are aiming to test our system on all the cases filed in the history of the English Wikipedia. Later on, it would be ideal to have a system like this running in the background and pro-actively scanning all active editors in Wikipedia, instead of running in a user triggered mode." If all the resulting similarity scores would be public, it would be doubtful that this would remain uncontroversial - many editors (especially on the German Wikipedia) are uncomfortable with the publication of aggregated analysis data about their editing behavior, even if it is based purely on information that is already public; compare the current RfC on Meta about X!'s Edit Counter.

The authors state that "to the best of our knowledge, we are the first to tackle [the problem of real sockpuppet cases in Wikipedia]" with this kind of stylometric analysis. This may only be accurate in an academic context. For example, in a high-profile sockpuppet investigation on the English Wikipedia in 2008, User:Alanyst applied the tf-idf similarity measure to the aggregated edit summaries of all users who had made between 500 and 3500 edits in 2007. (This measure compares the relative word frequencies in two texts.) The analysis confirmed the sockpuppet suspicion against two accounts A and B: Account B came out closest to A, and account A 188th closest to B (among the 11,377 tested accounts). For an overview of this and other methods developed by Wikipedians to evaluate the validity of sockpuppet suspicions, see the slides of this reviewer's talks at and the  Chaos Congress 2009.

Adjusting automatic quality flaw predictors by topic areas
Building on their earlier work on the feasibility of automatically assigning maintenance templates to articles (review: "Predicting quality flaws in Wikipedia articles"), three German researchers investigate how an article's topic might inform the detecting of text that needs to be tagged for quality problems. In this paper, they focus on maintenance tags for neutrality (e.g., ) or style (e.g.  ), cataloguing them into "94 template clusters representing 60 style flaws and 34 neutrality flaws". As an example of a maintenance tag that is restricted to certain topic areas, they cite "the template in-universe... which should only be applied to articles about fiction." Differing standards between different WikiProjects are named as another possible reason for "topic bias" in maintenance tags. To make their classification algorithm more aware of article's topics when assigning maintenance templates, the researcher modify their previous approach by populating their "positive" and "negative" training sets by revision pairs from the same articles: The version where a (human) Wikipedian had inserted a maintenance tag first, and the later revisions of the same article when the tag is removed (assuming that the corresponding flaw has indeed been eliminated at that time). To evaluate the success of this approach, the authors introduce the notion of a "category frequency vector" assigned to a set of articles (counting, for each category on Wikipedia, how many articles from this category are contained in the set). The cosine of the vectors of two article sets measures how similar their topics are. They find that "topics of articles in the positive training sets are highly similar to the topics of the corresponding reliable negative articles while they show little similarity to the articles in the random set. This implies that the systematic bias introduced by the topical restriction has largely been eradicated by our approach." Sadly, this evaluation method does not seem to have yielded direct information about which quality flaws are prevalent in which topic areas. Apart from their own software, the researchers used the WikiHadoop software to analyze the entire revision history of the English Wikipedia, and the machine learning tool Weka to classify article text.

Briefly

 * High quality articles show benefit of WikiProjects, but low quality articles may be monitored too tightly: A bachelor's thesis in economics, submitted last year to the University of Victoria, compares design principles of common-pool resource (CPR) institutions - developed among others by the late Nobel prize winner Elinor Ostrom - to Wikipedia, with particular attention to WikiProjects. Based on a sample of 120 articles and their talk pages, the paper investigates the statistical relationship of several variables, such as quality ratings (by Wikipedians), average pageviews, number of watchers, the amount of edits and reverts both by registered and anonymous users, and the number of Wikiprojects that regard the article as within their remit (per the templates on the talk page). As the main result, the author reports "that Wikiprojects play a significant role in high quality articles. This may be due to Wikiprojects ability to facilitate the creation of stylistic guidelines. By doing so, they clarify what constitutes desirable editing activity thereby decreasing the costs of reaching consensus and excluding members that do not conform to their guidelines."  He also indicates that WikiProjects "play an important role in impeding reductions in quality caused by anonymous contributors. However, the allocation of too much effort to monitoring and exclusion activity in lower quality pages may decrease the potential benefits provided by anonymous users editing activity. The results show a negative impact of registered user deletions on an articles quality which suggests this may be the case."
 * Wikipedia as "social machine": A paper presented at last month's WWW conference examines the development of Wikipedia as a "social machine", interpreting the relation between entities such as Nupedia, the Wikimedia Foundation, Jimmy Wales, Wikipedia article editors or developers in terms of actor–network theory. This serves as an example to illustrate a new model proposed by the authors, called "HTP" (for Heterogeneous networks, Translation, Phases).
 * Arabic corpus from Wikisource: Three researchers from Constantine, Algeria and Valencia, Spain describe a method for "Building Arabic Corpora from Wikisource" for the purpose of plagiarism detection. Discarding religious texts, poetry, legal texts and dictionaries, arriving at a corpus of 1008 documents by 114 authors. They remark that "Wikisource is the only resource that clearly provides Arabic content (with the [criterion that documents are book-like, having only one defined author]) without copyright."
 * Automatic recommendation of news sources:: A paper titled "Multi-step Classification Approaches to Cumulative Citation Recommendation" promises a method to recommend suitable news sources to Wikipedia editors.
 * A semi-automated categorization system for Wikipedia articles: A technical report from the School of Engineering and Applied Sciences at University of Pennsylvania, details a semi-automated categorization system that would let Wikipedia editors choose missing category links for articles, based on the current categorization of neighboring articles. The authors discuss the system implementation and an evaluation based on a sample of 100 articles. They conclude that the algorithm can work well only with articles that have received at least a categorization in 4 different classes.
 * Semantic similarity from the Wiki game: A poster presented at the 2013 edition of the World Wide Web conference (WWW'13) describes how relatedness can be computed from data of the Wiki game, a race game in which a reader, starting from a given entry, must reach a target entry in the least number of clicks (original rules here, xkcd comic).