Wikipedia:Wikipedia Signpost/2011-09-26/Recent research

What the most active female editors contribute
A paper addressing gender imbalance in Wikipedia ("Gender differences in Wikipedia editing") by Judd Antin and collaborators won the "Best Short Paper" award at WikiSym. This follows the awarding of "best full paper" to another study on the gender gap already covered in previous editions of the research newsletter. The study by Antin and collaborators sampled 256,190 users who created a new account on the English Wikipedia between September 2010 and February 2011 and qualitatively coded their contribution by category of wiki work. The results suggest that, whereas in the lower three quartiles by activity level men and women make roughly the same contributions in each category of wiki work, in the top quartile editors behave in a significantly different way. The researchers found that among the top 25% of Wikipedians by activity level:
 * only 27% of all revisions are made by women;
 * women tend to make larger revisions than men;
 * top female editors make significantly larger revisions than men in at least two categories: "adding new content" and "rephrasing existing text"

Effects of reverts on wiki work
Another WikiSym 2011 paper by GroupLens researchers, including Summer of Research fellow Aaron Halfaker ("Don’t bite the newbies: how reverts affect the quantity and quality of Wikipedia work"), reports on the effects of reverts on the quality and quantity of Wikipedia editors, with a specific focus on newbies. The study uses a number of key metrics to assess the quality of editor contributions (using reverts per revision and Persistent Word Revisions or PWR, to measure the survival across revisions of words added by an editor, other than stop-word) and changes in editor activity (using a controlled activity delta that calculates an editor's variation of activity across weeks with respect to the week preceding the revert, normalized by the editor's daily rate of activity). The results point at the same time at the important role of reverts as a learning and quality improvement process but also at their negative effects on new contributors. Below are highlights from this study: These results are consistent with the findings by Summer of Research fellows on the effects of community interactions with new Wikipedians.
 * Compared with their activity prior to a revert, in the first week after a revert reverted editors decrease their activity by 0.1 standard deviations compared with an increase in a control group of non-reverted editors of about 0.3 standard deviations.
 * It matters who performs a revert: editors reverted by a registered editor do not recover to the average level of activity for at least one month, whereas editors reverted by anonymous users recover much faster.
 * Reverts affect the quality of one's work: reverted editors are less likely to be reverted in the future (particularly in the week after the revert), whereas the probability of being reverted in the control group keeps growing every week. Reverted editors are also less likely to make important changes to an article after being reverted, compared with the control group. However, the productivity of reverted editors in the following weeks increases more rapidly than non-reverted editors.
 * Reverts affect newbies more negatively. Experienced editors are less affected by reverts on their average activity while newbies are significantly less likely to continue editing after a revert than experienced editors.

Further Wikipedia coverage at WikiSym 2011: Social dynamics and global reach

 * The two papers on gender gap mentioned above will be presented in a session titled Understanding Wikipedia, along with other original works some of which were already reviewed in the research newsletter, such as a study by researchers from the University of Pennsylvania examining revision deletion in the English Wikipedia (see also the summary posted on AcaWiki).
 * A second research session will be devoted to Wikipedia as a global phenomenon. It will feature two papers focusing on Wikipedia's coverage of rapidly changing events such as the 2011 Egyptian revolution and the 2011 Tōhoku earthquake and tsunami. The analysis of Wikipedia coverage of the Egyptian revolution, by a team of Italian researchers based in Trento (from the same lab that released the WikiTrip visualization, previously covered in The Signpost), is available as a preprint.

Link spam research with controversial genesis but useful results
The "Wiki tools and interfaces" session at WikiSym will see the presentation of a paper titled "Autonomous link spam detection in purely collaborative environments". According to the five authors from the University of Pennsylvania, link spam is currently "an annoying, but non-pervasive issue", but could become a grave threat to Wikipedia if new spam techniques that were explored by some of them in another paper (see below) become more widespread.

Using the STiki software by one of the authors, which is already widely used as an anti-vandalism tool on the English Wikipedia, the researchers collected mainspace edits adding external links and extracted a corpus of 5,962 link additions classified as either ham or spam, using criteria such as whether the edit had been rolled back (to determine spam), or whether it had been added by a user with rollback rights (to determine ham). From this, the researchers derived numerous features that indicate link spamming behavior, in three areas: On-wiki evidence (including very simple metrics such as the URL's length – spam links tend to be shorter – or that older and more popular articles are more likely to be targeted), properties of the landing page that the link points to (these were found to be less useful), and classification from third-party sites, including Alexa and Google Safe Browsing. The backlinks data provided by Alexa proved to be most useful for the classifier that the authors went on to construct, and tested in a live implementation in the STiki tool. They conclude that "it is clear this work will benefit the Wikipedia community".

In another paper, presented earlier this month at CEAS ‘11, five authors from the same university including two of the same researchers examine the possibility of "Link spamming Wikipedia for profit". They picture spam detection on Wikipedia as a pipelined process, with the MediaWiki spam blacklist as the first stage (currently containing around 17000 regular expressions), recent changes patrollers (often aided by software tools) as the next – often reacting within seconds after an edit, watchlisters as the third (within minutes to days), and finally review by normal readers as the last stage. Based on a spam/ham corpus constructed as in the other paper, this paper contains some further analysis of the characteristics of link spam destinations and spamming accounts, and of the exposure spammed links receive before they are removed (determined by both the link's lifespan and the popularity of the spammed page). The most sensitive part of the paper then leverages these results to "describe a novel and efficient spam model we estimate can significantly outperform status quo techniques", e.g. by rapidly adding links to exploit the time lag of Wikipedia's spam removal process, or targetting popular pages. In a nod to WP:BEANS, the researchers admit that "there is the possibility that we have introduced previously unknown vectors", but the "Ethical Considerations" section emphasizes that:
 * "It is in no way this research’s intention to facilitate damage to Wikipedia or any wiki host. The vulnerabilities discussed in this section have been disclosed to Wikipedia’s parent organization, the Wikimedia Foundation (WMF). Further, the WMF was notified regarding the publication schedule of this document and offered technical assistance."

The authors also point to the implementation of the spam mitigation tool described in the WikiSym article.

However, the paper fails to mention that last year, one of its authors conducted actual, extensive tests of spamming techniques on the English Wikipedia that are very similar to those outlined in the paper. The spam attacks gained the attention of several IT security news websites, and even involved setting up a fake webshop to measure how many Wikipedia readers would have carried out an actual purchase of the penis enlargement pills advertised in the links. The case led to the researcher's temporary ban as a Wikipedia user, later lifted by the arbitration committee, and informed the research guidelines drafted later that year by the Wikimedia Foundation's Research Committee. See Signpost coverage: "Large scale vandalism revealed to be 'study' by university researcher" (includes a background interview with the researcher).

How social ties influence admin votes
A paper by three researchers from the University of the Philippines Diliman, presented at the International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) two months ago, examined statistical relations between the voting behavior in requests for adminships (RfAs) and the on-wiki social contacts of participants. The paper includes a brief review of existing literature (in particular two papers which already studied the relation with existing social networks ). Drawing from a January 2008 dump of the English Wikipedia, they analyzed 2,587 elections conducted between 2004 and 2008 (48% of them successful, with 7,231 users voting or running in at least one RfA, and 80% of the final non-neutral votes being supportive), and "1,097,223 instances of communication between 265,155 distinct pairs of users" who had run or voted in an RfA – from user talk page messages, an undirected social graph was generated. Their results concern three areas:
 * "Factors that motivate participation": As a first result, the researchers found that the number of a user's contacts who already voted in an RfA, and (more strongly) whether the user had been in contact with the candidate, "contribute positively to the probability of a user’s participation in an election. This may be due to the fact that voters are inclined to support candidates with whom they are acquainted with."
 * As "Factors that influence voting" (i.e. the support/oppose decision) the authors considered the numbers of "support" and "oppose" votes that a user's contacts have already cast when the user votes, and whether the user had been in contact with the candidate before. All yielded regression coefficients with the expected sign (acquaintance with the candidate weighing positively), and the authors conclude that "we can already explain voting behavior by just examining the immediate neighborhood of a voter", but note that "it is interesting to note that the presence of contacts who have voted negatively weighs more heavily compared with those who voted positively."
 * Finally, the paper examined "Influential voters in the social network", by calculating various well-known social network metrics for both the "support" and "oppose" camps in each election ("degree, closeness centrality, betweenness centrality, authority, hub, PageRank, clustering coefficient, and eigenvector centrality", averaged over all voters in each camp, and combined into a weighted difference). Closeness, PageRank, and eigenvector centrality were found to have the largest regression coefficients in predicting the outcome of an RfA, suggesting to the authors "that decisions of influential nodes can affect the outcome of the RfA process. Although it was not studied in this paper, a possible explanation for this result is that influential users may sway other users to vote the same way and this aggregate behavior may have an impact on the result of the election".

Wikipedians' weekends in international comparison
A paper titled "Temporal characterization of the requests to Wikipedia" examined how search requests, read accesses and edits on Wikipedia change over time, and relate to those at the entirety of Wikimedia sites (based on squid logs for the whole year of 2009, provided by the Wikimedia Foundation). Among findings are differences between language versions of Wikipedia, such as that the "the number of edits tends to raise in weekends" for the French, Japanese, Dutch and Polish Wikipedia, but not for other languages. Another paper, titled "Circadian patterns of Wikipedia editorial activity: A demographic analysis", similarly analyzed "34 Wikipedias in different languages [trying] to characterize and find the universalities and differences in temporal activity patterns of editors", with the underlying data provided by the German Wikimedia chapter from the toolserver. They found that "in contrast to diurnal [daily] pattern, which is universal to a great extent, weekly activity patterns of WPs show remarkable differences. We could, however, identify two main categories, namely 'weekends' and 'working days' active WPs."

In brief

 * Gender bias in Wikipedia and Britannica: An article by Joseph Reagle and Lauren Rhue titled "Gender bias in Wikipedia and Britannica" examines gender bias in biographical coverage, comparing the English Wikipedia and the Encyclopedia Britannica. The study suggests that "Wikipedia provides better coverage and longer articles, and that it typically has more articles on women than Britannica in absolute terms, but we also find that Wikipedia articles on women are more likely to be missing than are articles on men relative to Britannica". See the accompanying blog post with the full datasets used in this study.
 * Wikipedia as a potlatch: Spanish researcher Felipe Ortega compares Wikipedia to the potlatch, a traditional gift-giving ceremony whose participants gain status based on the generosity of their gifting, in this blog post summarizing his new book with Joaquín Rodríguez ( “El Potlatch Digital: Wikipedia y el Triunfo del Procomún y el Conocimiento Compartido” ["The Digital Potlatch: Wikipedia and the Triumph of Commons and Shared Knowledge"], published in Spanish by Ediciones Cátedra. Drawing from new qualitative research (interviews with editors of the Spanish Wikipedia) as well as existing quantitative research, the book concludes that recognizing the gifts Wikipedians make, through meritocracy and explicit acknowledgement, helps motivate participation.
 * PSBOCT.png was one of the articles edited by the University of Texas students during the course.]] How medical students edit Wikipedia: A paper published last month by the Kansas Journal of Medicine asked "Are students able and willing to edit Wikipedia to learn components of evidence-based practice?" In 2007 and 2008, two groups of senior medical students at the University of Texas Health Science Center at San Antonio participated in an exercise where they were asked "to place succinct summaries of [medical] studies in Wikipedia" (after a four hour introductory course on wikis). In a survey, 91% of them said that the project should be offered again in the next year, and 71% planned to edit Wikipedia again. (The authors caution that this group was self-selected.) The articles were examined two months after their edits, and 46% of the students had their contribution improved in some way, while "the pages edited by 62% of students had additional edits in response to incidental vandalism to the pages, but in no instance was the vandalism done to an edit by a student".
 * Ethnography of wikiculture set free: Joseph Reagle's 2010 book on the cultural dynamics of Wikipedia, Good Faith Collaboration, is now freely available to read online, having been released under an accommodating Creative Commons licence (CC BY-NC-SA 3.0).
 * Provenance for Wikipedia articles: A (closed access) doctoral dissertation defended at the University of Arizona presents a "domain ontology of provenance for Wikipedia based on the W7 model", building on the notion of the five Ws. The author applies this ontology to extract provenance for Wikipedia articles and to assess their quality, thereby identifying "several collaboration patterns that are preferable or detrimental for data quality".
 * Geographies of the World's Knowledge: as already mentioned in  last week's Signpost, the floatingsheep collective, in collaboration with the Oxford Internet Institute, released a report titled "Geographies of the World's Knowledge" visualizing the temporal and geographical distribution of Wikipedia articles. Drawing from roughly 1.5 million articles in a 2010 database download, the report revealed among other findings that more articles had been written about Antarctica (7,800) than any South American or African nation, that the country with the most internet users (China) accounted for barely 1% of articles, that its biographical articles overwhelmingly geolocate to Western Europe and, from the 18th century on, North America, and that vastly more biographies per year were written for the 20th and particularly the 21st century compared to preceding time periods. The report is released under a Creative Commons BY-NC-ND license.
 * Wikipedia found to have grown until 2007: A paper by a sociology researcher from the University of York, titled "Measuring the Development of Wikipedia", explores the development of the number of edits and the number of participants on the English Wikipedia from 2002 to 2007 (curiously asserting that "there is only 6 years data"). As first result, the research "reveals that the number of edits and the total number of participants both increased in Wikipedia from 2002 to 2007". The paper's most tangible contribution appears to consist of histograms plotting the number of users with a particular edit count in each of the years 2002 to 2007, which the author finds "are similar with the Pareto distribution in the shape, [and therefore] we assume that the participation situation in Wikipedia is one type of the Pareto Distribution". A large part of the four page paper (available for $26) is devoted to general explanations of this distribution. It also mentions the need to use a statistical method such as maximum likelihood estimation to confirm the optical impression that the histograms follow the Pareto distribtion, but it remains unclear if the author actually carried this out. Also, despite emphasizing several times the importance of determining the changes in the k parameter over the years (a measure of the inequality associated with the postulated Pareto distribution) – calling it "vital to model the participation situation in Wikipedia" -, the actual values are never given. The abstract promises "an equation to predict future development trend of Wikipedia", but it remains unclear to this reviewer which equation this refers to.