Wikipedia:Wikipedia Signpost/2012-06-25/Recent research

Dynamics of edit wars
"Dynamics of Conflicts in Wikipedia" develops an interesting "measure of controversiality", something that might be of interest to editors at large if it were a more widely popularized and dynamically updated statistic. The paper analyzes patterns of edit warring over Wikipedia articles. The authors conclude that edit warriors are usually willing to reach consensus, and that the rare cases of never-ending warring are those that continually attract new editors who have not yet joined the consensus.

The authors' decision to exclude from the study articles with under 100 edits because they are "evidently conflict-free" is questionable. Articles with fewer than 100 edits have been subject to clear, if not overly long, edit warring. A recent example is Concerns and controversies related to UEFA Euro 2012. It is also unfortunate that "memory effects" – a term mentioned only in the abstract and lead, and which the authors suggest is significant in understanding the conflict dynamic – is not explained in the article. The term "memory", by itself, appears four times in the body, but is not operationalized anywhere.

A press release accompanied the paper, entitled "Wikipedia 'edit wars' show dynamics of conflict emergence and resolution". An MSNBC tech news headline misleadingly, but sensationally, summarized it as "Wikipedia is editorial warzone, says study".

Who deletes Wikipedia?
In a recent blog post by Wibidata, an analytics startup based in San Francisco, the authors set out to shed light on the often-quoted claim that most of Wikipedia was written by a small number of editors, noting other editorial patterns along the way. Using the entire revision history of English Wikipedia (they wanted to show that their platform can scale), the authors looked at the distribution of edits across editor cohorts, grouped by number of total edits. They found that from a pure count perspective, the most active 1% of editors had contributed over 50% of the total edits. (see original plot here)

In response to the suggestion that the strongly skewed distribution of edits might just be due to a core set of editors who primarily make only minor formatting modifications, they looked at the net number of characters contributed by each editor. Grouping editors by total number of edits as before, they showed an even more strongly skewed distribution, with the top 1% contributing well over 100% of the total number characters on Wikipedia (i.e. an amount of text that is larger than the current Wikipedia) and the bottom 95% of editors deleting more on average than they contributed (original plot). Next, the authors separated logged in users from non-logged in "users" (identified only by IP addresses) and recomputed the distribution of net character contributions. By edit-count cohort, logged-in users tended to contribute significantly more than their anonymous counterparts, and non-logged-in users tended to delete significantly more (original plot).

In summary, low-activity and new editors, along with anonymous users, tend to delete more than they contribute; this reinforces the notion that Wikipedia is largely the product of a small number of core editors.

Evaluating and predicting interlingual links in Wikipedia
Published in proceedings of *SEM, a computational semantics conference, researchers from the University of North Texas and Ohio University looked into the nature of interlingual links on Wikipedia, both reviewing the quality of existing links and exploring possibilities for automatic link discovery. The researchers took the directed graph of interlingual links on Wikipedia and used the lens of set-theoretic operations to structure an evaluation of existing links, to build a system for automatic link creation. For example, they suggest that the properties of symmetry and transitivity should hold for the relation of interlingual linking. This means that if there is an interlingual link from language A to B, there should also be a link from B to A, and if there is a link from language A to B, and language B to C, then there should be a link from language A to C. (This assumption is routinely made by the many existing Interwiki bots.) They further refine the notion of transitivity, by grouping article pairs by the number of transitive 'hops' required to connect a candidate article pair.

Their methodology revolves around the creation of a sizeable annotated gold data set. Using these labels, they first evaluated the quality of existing links, finding between one half and one third to fail their criteria for legitimate translations. They then evaluated the quality of various implied links. For example, reverse links where they do not already exist satisfy their criteria for faithful translation only 68% of the time.

The gold data set was used to train a boosted decision-tree classifier for selecting good candidate pairs of articles. They used various network topology features to encode the information in interlingual links for a given topic and found that they can significantly beat the baseline, which uses only the presence of direct links (73.97% compared with 69.35% accuracy).

"Wikipedia Academy" preview
Various conference papers and posters from the upcoming "Wikipedia Academy" (hosted by the German Wikimedia chapter from June 29 to July 1 in Berlin) are already available online. A brief overview of those which are presenting new research about Wikipedia:


 *  more effective than  : "On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia" shows "that inline tags are more effective than tag boxes" in tagging article flaws so that they get remedied. The researchers also "reveal five cleanup tags that have not been used at all, and 15 cleanup tags that have been used less than once per year", recommending their deletion, and "ten cleanup tags that have been used, but the tagged flaws have never been fixed." Similar to a paper reviewed in the April issue of this report (One in four of articles tagged as flawed, most often for verifiability issues"), they find that "the majority (71.62%) of the tagged articles have been tagged with a flaw that belongs to the flaw type Verifiability".
 * A paper titled "The Power of Wikipedia: Legitimacy and Territorial Control", is "based on the experience of the projects WikiAfrica (2006-2012) and Share Your Knowledge (2011-2012)", and looks at various aspects of Wikipedia, Wikimedia chapters and the Foundation through the lens of "anthropological, african and post-colonial studies."
 * "Individual and Cultural Memories on Wikipedia and Wikia, Comparative Analysis" looks at the coverage of the late British DJ John Peel on Wikipedia and Wikia, respectively, as well as the Wikipedia article about the 1980s.
 * An "Extended Abstract", "Latent Barriers in Wiki-based Collaborative Writing" compares the collaborative process "25 special-purpose wikis" (most of them hosted by Wikia) with that of the German Wikipedia. One observation of the work in progress is a "strong divide between extracts of Wikipedia (even if being reduced to single articles and their one-link neighborhoods) on the one hand and special purpose wikis on the other."
 * Two Brazilian authors will examine "the climate change controversy through 15 articles of Portuguese Wikipedia". The paper contains various quantitative results about the edit history of these articles, some of them unsurprising ("A very strong positive correlation (0.994) was found between the number of edits and the number of editors of an article"). Using the framework of actor–network theory, the authors conclude that "the collaborative encyclopedia is enrolled as an ally for the mainstream science and becomes one of its spokespersons."
 * Historical infobox data: An article by four authors from Google Switzerland and the Spanish National University of Distance Education (UNED) observes that "much research has been devoted to automatically building lexical resources, taxonomies, parallel corpora and structured knowledge from [Wikipedia]", often using the structured data present in infoboxes (which they say are present in "roughly half" of English Wikipedia articles). However, this research has so far used only snapshots representing the state of articles at a particular point in time, whereas their project embarked to extract "a wealth of historical information about the last decade ... encoded in its revision history." The resulting 5.5GB dataset, called "Wikipedia Historical Attributes Data (WHAD)", will be made freely available for download.
 * Better authorship detection, and measuring inequality: Two researchers from the University of Karlsruhe will present an algorithm to detect which user wrote which part of a Wikipedia article.  Similar to a new revert-detection algorithm presented in a recent paper co-authored by one of the present authors (see last month's issue: "New algorithm provides better revert detection"), one crucial part of the algorithm is to split the article's wikitext into paragraphs, analyzing them separately under the assumption "that most edits (if they are not vandalistic) change only a very minor part of an article’s content". Another part is calculating the cosine similarity of sentences that are not exactly identical. In the authors' own test, the new algorithm performed significantly better than the widely used WikiTrust/WikiPraise tool. Having determined the list of authors for an article revision and the size of each author's contribution, they then define a gini coefficient "as an inequality measure of authorship" (roughly, an article written by a single author will have coefficient 1, while one with equal contributions by a multitude of editors will have coefficient 0). They implement a tool called "WIKIGINI" to plot this coefficient over an article's history, and show a few examples to demonstrate that it "may help to spot crucial events in the past evolution of an article". The paper starts out from the assumption "that the concentration of words to just a few authors can be an indicator for a lack of quality and/or neutrality in an article", but it does not (yet) contain a systematic attempt to correlate the gini coefficient and existing measures of article quality.
 * Troll research compared: A paper by a German Wikipedian titled "Here be Trolls: Motives, mechanisms and mythology of othering in the German Wikipedia community" examines four academic texts about online trolls (only one of them in the context of Wikipedia), which "were compared regarding their scope, their theoretical approach, their methods and their findings concerning trolls and trolling.


 * Posters
 * "Self-organization and emergence in peer production: editing 'Biographies of living persons' in Portuguese Wikipedia
 * "Biographical articles on Serbian Wikipedia and application of the extraction information on them"
 * "Wikipedia article namespace – user interface now and a rhizomatic alternative"
 * "Extensive Survey to Readers and Writers of Catalan Wikipedia: Use, Promotion, Perception and Motivation"

Researcher Felipe Ortega blogged about a new parser for Wikipedia dumps, to be integrated into "WikiDAT (Wikipedia Data Analysis Toolkit) ... a new integrated framework to facilitate the analysis of Wikipedia data using Python, MySQL and R. Following the pragmatic paradigm 'avoid reinventing the wheel', WikiDAT integrates some of the most efficient approaches for Wikipedia data analysis found in libre software code up to now", which will be featured in a workshop at the conference.

Special issue of "Digithum" on Wikipedia research
The open-access journal "Digithum" (subtitled "The Humanities in the Digital Era") has published a special issue containing five papers about Wikipedia from various disciplines, with a multilingual emphasis (including research about non-English Wikipedias, and Catalan and Spanish versions of the papers alongside the English versions):


 * Are articles about companies too negative?: A paper titled "Wikipedia’s Role in Reputation Management: An Analysis of the Best and Worst Companies in the United States" looked at the English Wikipedia articles about the ten companies with the best and worst reputations according to the "Harris Reputation Quotient", a 2010 online survey about "perceptions for 60 of the most visible companies in America". Those 20 articles were coded, sentence by sentence, as positive, negative or neutral, and according to other "reputation attributes". Among the findings was that "the companies with the worst reputations had more negative content; they had, in fact, almost double the amount of negative content, although only slightly less positive content. Both types of companies had more negative than positive content. This indicates that even if a company is considered to have a good reputation, it is still very vulnerable to having its dirty laundry aired on Wikipedia." Another observation was that "emotional appeal is an attribute where both types of companies lacked content. It was rare for companies to have content about trust or feeling good, which only existed for the best companies" (an interesting question may be whether this is related to Wikipedia guidelines such as WP:PEACOCK). The paper appears at a time where many PR industry professionals in the US and UK argue that Wikipedia should allow them more control over the articles about their clients, and ends by highlighting the "importance of public relations professionals monitoring and requesting updates to Wikipedia articles about their companies". This conclusion resembles that of another recent study by one of the authors (DiStaso), which likewise concerned company articles, implicating a somewhat controversial conclusion about their accuracy (see the April issue of this research report: Wikipedia in the eyes of PR professionals).
 * WordNets from Wikipedia': The second paper describes "the state of the art in the use of Wikipedia for natural language processing tasks", including the researchers' own application of Wikipedia to build WordNet databases in Catalan and Spanish.
 * The Wikimedia movement as "wikimediasphere": The article "Panorama of the wikimediasphere" gives an overview of the Wikimedia movement, proposing the term "wikimediasphere" to describe it, and explaining "the role of the communities of editors of each project and their autonomy with respect to each other and to the Wikimedia Foundation", which is seen as "the principal supplier of the technological infrastructure and also the principal instrument for obtaining economic and organisational resources". Its vision statement is presented as a summary of the aim that is "the ideological glue that binds all the players involved". The section about "the social and institutional dimension" of the sphere briefly covers the Foundation's governance and funding models, Wikimedia chapters and other recognized supporting organizations, and the various wikis and other online platforms that structure "the organisational activity": The Foundation wiki, Meta-wiki, Strategy wiki, Outreach wiki, the Wikimedia blog and the blogs of community members aggregated on Planet Wikimedia, mailing lists etc. Authored by a Wikimedian who is a member of both the Spanish chapter and the Catalan "Friends of Wikipedia" association, the paper is remarkably well-informed and up-to-date, e.g. incorporating the Board resolution on "Recognized Models of Affiliations" from the beginning of April, and various other recent events such as the English Wikipedia's SOPA/PIPA blackout. The abstract uses the term "WikiProjects" in a different sense from that common among English-speaking Wikimedians, possibly a translation error.
 * Truth and NPOV: The fourth article by Nathanial Tkacz (one of the organizers of the "Critical Point of View"/CPOV initiative that organized three conferences about Wikipedia in 2010, see Signpost interview) sets out to "show that Wikipedia has in fact two distinct relations to truth: one which is well known and forms the basis of existing popular and scholarly commentaries, and another which refers to equally well-known aspects of Wikipedia, but has not been understood in terms of truth. I demonstrate Wikipedia's dual relation to truth through a close analysis of the Neutral Point of View core content policy (and one of the project's 'Five Pillars')." Wiki Loves Monuments 2011 uploads by country.png
 * Wiki Loves Monuments: A paper titled "Wiki Loves Monuments 2011: the experience in Spain and reflections regarding the diffusion of cultural heritage", written by five Spanish Wikimedians, gives a concise overview of the photo contest as it played out in Spain last year.

Briefly

 * Who was notable in London in the 1960s?: A Master's thesis in Computer Science describes "A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia". The tool, named "Kivrin" after a time-travelling character from a science fiction novel, is available online, and grew out of an earlier, simpler one that searches for articles about plants and animals living at a particular geographical place ("Flora & Fauna Finder"). The author remarks that "the data is skewed, like Wikipedia itself, towards the U.S. and Western Europe and relatively recent history". A search for the 1960s in London brings up several Beatles-related biographies near the top. While the tool does seem to cover languages other than English (e.g. text from the Hungarian entry on Gottlob Frege appears in the search results for Jena, the German town), searches for Hungarian or other non-English place names (e.g. Moszkva and Москва, the Hungarian and Russian names of Moscow) yielded no results. Disambiguation is attempted by way of geocodes but far from robust - the search results for Halle, Saxony-Anhalt actually contain multiple entries referring to Halle, North Rhine-Westphalia.
 * How did people in Europe feel in the 1940s?: As described in a post in the New York Times' "Bits" blog Kalev Leetaru from the University of Illinois conducted a sentiment analysis of statements on Wikipedia connected to a particular space and time, and made the result into a video: "The Sentiment of the World Throughout History Through Wikipedia"
 * One third of the average Wikipedia consists of interwiki links: According to an analysis by Denny Vrandečić, head of the Wikidata development team, "on average, 33% of a Wikipedia is language links. In total, there are 240 Mio of them, 5GB" (making up 5.3% of the overall text across all languages). The ratio tends to be higher on smaller Wikipedias.
 * "Central" users produce higher quality: A preprint by two Dublin-based researchers attempts "Assessing the Quality of Wikipedia Pages Using Edit Longevity and Contributor Centrality". The former uses the assumption that contributions which survive many subsequent edits tend to have a higher quality, and "measures the quality of an article by aggregating the edit longevity of all its author contributions". The second approach considers either the coauthorship network (the bipartite graph of users and the articles they have edited, used in many recent papers to grasp Wikipedia's collaboration processes) or the user talk page (UTP) network, where two Wikipedians are connected if one has edited the other's talk page. It is assumed that a user's "centrality" in one of these networks is a measure for the "contributor authoritativeness". These quality measures are then evaluated on 9290 history-related Wikipedia articles against the manual quality rating from WikiProject History. "The results suggest that it is useful to take into account the contributor authoritativeness (i.e., the centrality metrics of the contributors in the Wikipedia networks) when assessing the information quality of Wikipedia content. The implication for this is that articles with signiﬁcant contributions from authoritative contributors are likely to be of high quality, and that high-quality articles generally involve more communication and interaction between contributors."
 * Familiarity breeds trust: A bachelor thesis at Twente University had 40 college students assess the trustworthiness of articles from the English Wikipedia, after a search for a piece of information in the article that was either present at the top or near the bottom of the article. The hypothesis that the longer search in the second case might affect the trustworthiness rating was rejected by the results, but it was found (consistent with other research) that "Trust was higher in articles with a familiar topic, rather than with unfamiliar topics".