Wikipedia:Wikipedia Signpost/2015-01-28/Recent research

Bot detects theatre play scripts on the web and writes Wikipedia articles about them
A paper presented at the International Conference on Pattern Recognition last year (earlier poster) presents an automated method to improve Wikipedia's coverage of theatre plays ("only about 10% of the plays in our dataset have corresponding Wikipedia pages"). It searches for playscripts and related documents on the web, extracts key information from them (including the play's main characters, relevant sentences from online synopses of the play, and  mentions in Google Books and the Google News archive in an attempt to ensure that the play satisfies Wikipedia's notability criteria). It then compiles this information into an automatically generated Wikipedia article. Two of the 15 articles submitted as result of this method were accepted by Wikipedia editors. For the first, Chitra by Rabindranath Tagore, the initial bot-created submission underwent significant changes by other editors ("the final page reflects some of the improvements we can incorporate in our bot"). The second one, Fourteen by Alice Gerstenberg, "was moved into Wikipedia mainspace with minimal changes. All the references, quotes and paragraphs were retained".

"Renaissance Editors" create better Wikipedia content
A study of the German Wikipedia, about the diversity of editor contributions among the 8 "main categories", shows a relationship between editor diversity and quality. The authors start by defining an "interest profile" of an editor – the proportion of bytes contributed across all categories. Then an entropy measure is proposed which rewards an interest profile for being more distributed across more categories – having a polymath style.

There is a correlation shown between the average diversity of contributors and what types of article quality they've contributed to. Article quality is determined based on whether the article is a "Good Article", "Featured Article", or neither. It is also shown that total productivity, measured by bytes contributed, is linked to diversity, only marginally insignificantly. Finally, a logistic regression shows that diversity more than productivity significantly determines article quality.

Despite too many simplifications (e.g. single language, naive article quality ratings, too broad categories), the methods used by the researchers are well-defined, clear, and convincing in a limited scope, and place a finger on the notion that our most lauded editors tend to run all over Wikipedia.

Briefly

 * In-depth examination of the history of three featured articles on the Swedish Wikipedia, and their main editors: This paper looks at collaboration on the Swedish Wikipedia via a qualitative analysis of three Featured Articles. Information is pulled into the articles from a variety of sources including other language Wikipedias and curated by editors. The qualitative study found the articles' growth followed a similar trajectory and were contributed to by both content and process oriented editors, in what the author calls a process of ‘intercreation.’
 * "Contropedia" tool identifies controversial issues within articles: This paper discusses the formation of a new method for identifying and examining controversial issues within Wikipedia articles. The paper outlines the development of an algorithm used to identify the most contested topics via an analysis of the edits surrounding wikilinks. The resulting Contropedia tool (already presented at WikiSym 2014 ) provides an excellent visual presentation of hot button issues in a given article. The authors note that the tool has the potential to be of use to researchers interested in studying the evolution of controversial issues over time in an article, as well as affording Wikipedians insight into potential sites of controversy.
 * "Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach": This volume on natural-language processing was semi-recently published, several chapters of which are about wikis, confirming their value for NLP research. Some results are still of some use.
 * "Micro-crowdsourcing" and shared translation memory are proposed as solutions to localising the web: in fact, both were already implemented by the Translate extension; respectively on translatewiki.net in 2009 and on Wikimedia wikis in 2012, years ahead of the researchers' idea.
 * A Basque Wikipedia machine translation experiment we didn't yet know of is reported (see Wikimania 2010). Translating 100 articles was enough to improve a machine translation system by 10%, which is encouraging for Wikimedia's Content translation project.
 * A compilative paper lists some tools to use on wiki talk pages, including an active freely licensed spellchecker. The rest was either (en|simple).wiki-specific or supersed by official client lists and Wikimedia Labs at the time of publication.
 * Italian linguists developed a CC-BY-SA dictionary based on Tullio De Mauro's, which they describe as "very close to Wiktionary" but with two differences in their platform: "senses (and their relationships) are first-class citizens [...] a rich interactive and WYSIWYG Web interface that is tailored to linguistic content."

Other recent publications
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
 * "The dynamic nature of conflict in Wikipedia" From the abstract: "With a small number of simple ingredients, our model mimics several interesting features of real human behaviour, namely in the context of edit wars. We show that the level of conflict is determined by a tolerance parameter, which measures the editors' capability to accept different opinions and to change their own opinion."
 * "Comprehensive Wikipedia Monitoring for Global and Realtime Natural Disaster Detection" (slides)
 * "Digital doorway: Gaining library users through Wikipedia" (about Template:Library resources box)
 * "Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History" From the abstract: "Hedera exploits Map-Reduce paradigm to achieve rapid extraction, it is able to handle one entire Wikipedia articles’ revision history within a day in a medium-scale cluster, and supports flexible data structures for various kinds of semantic web study."
 * "Learning to Identify Historical Figures for Timeline Creation from Wikipedia Articles"
 * "WiiCluster: A Platform for Wikipedia Infobox Generation"
 * "Proceed With Extreme Caution: Citation to Wikipedia in Light of Contributor Demographics and Content Policies"
 * "Wikipedia: helping to promote the art and science of civil engineering"