Wikipedia:Wikipedia Signpost/2017-07-15/Recent research

Chilling effects: The impact of surveillance awareness on Wikipedia pageviews
A paper in the Berkeley Technology Law Journal finds that the traffic to privacy-sensitive articles on the English Wikipedia dropped significantly around June 2013, when the existence of the US government's PRISM online surveillance program was first revealed based on documents leaked by Edward Snowden. As stated by the author, Jon Penney, the study "is among the first to evidence—using either Wikipedia data or web traffic data more generally—how government surveillance and similar actions may impact online activities, including access to information and knowledge online." It received wide media attention upon its release, as already reported last year in the Signpost.

The paper is part of a growing body of literature that studies the effect of external events on Wikipedia pageviews (for another example, see our previous issue: "How does unemployment affect reading and editing Wikipedia ? The impact of the Great Recession"). The 66-page paper stands out for its methodological diligence, devoting much space to explaining and justifying its data selection and statistical approach, and to checking the robustness of the results. The framework was adapted from an earlier MIT study that had similarly examined the effect of the Snowden revelations on Google search traffic for sensitive terms, finding a statistically significant reduction of 5%. The author emphasizes the higher quality of the Wikipedia data: "unlike Google Trends, the Wikimedia Foundation provides a wealth of data on key elements of its site, including article traffic data, which can provide a more accurate picture as to any impact or chilling effects identified."

To generate a list of Wikipedia articles that could be considered privacy sensitive in the context of US government surveillance, the author used a (publicly available) set of terms that the Department of Homeland Security (DHS) specifies as related to terrorism. The corresponding Wikipedia articles (48 altogether) include dirty bomb, suicide attack, nuclear enrichment (a redirect) and eco-terrorism. To verify the assumption that these topics are indeed considered as privacy sensitive by Internet users, a survey among 415 Mechanical Turk users asked them to rate each, e.g. on whether they would be likely to delete their browser history after accessing information about it.

To examine the impact on traffic, the paper uses the time series of monthly pageviews for the 48 articles (81 million views altogether, from January 2012 to August 2014). It is divided into the periods before and after the June 2013 "exogeneous shock". As a first finding, the author notes that the average monthly views in the "after" period are lower - but points out that such considerations (which e.g. form part of the difference in differences approach in the paper on unemployment mentioned above) are too simplistic to show an actual effect, e.g. because this could merely be caused by an overall declining traffic trend. (Although not stated directly in the paper, this is indeed the case, as the study is only based on desktop pageviews, which have been gradually replaced by mobile views in recent years. The Wikimedia Foundation makes combined mobile/desktop pageview datasets available going back to 2015.)

The author then turns to a more sophisticated statistical method known as interrupted time series analysis (ITS). It involves a "segmented regression analysis": linear trend lines are calculated separately for the timespans before and after June 2013, providing information both on the slope (growth/decrease rate) within each and on the size of the mismatch (if any) where the two segments intersect. This method indicates "an immediate drop-off of over 30% of overall views" following the June 2013 revelations. To further exclude the possibility that the results for these terrorism-related articles "may simply reflect overall Wikipedia article view traffic trends", an analogous ITS analysis is conducted for the pageviews to all Wikipedia articles. The author points out the importance of the results for the Wikimedia Foundation's current lawsuit that challenges the constitutionality of the NSA surveillance of Internet traffic.

See also our review of a recent qualitative study that examined the privacy concerns of editors: "Privacy, anonymity, and perceived risk in open collaboration: a study of Tor users and Wikipedians"

Conferences and events
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions are always welcome for reviewing or summarizing newly published research.''
 * "Using Wikipedia page views to explore the cultural importance of global reptiles" From the abstract:  "We analysed all page views of reptile species viewed during 2014 in all of Wikipedia's language editions. We compared species' page view numbers across languages and in relationship to their spatial distribution, phylogeny, threat status and various other biological attributes. We found that the three species with most page views are shared across major language editions, beyond these, page view ranks of species tend to be specific to particular language editions. Interest within a language is mostly focused on reptiles found in the regions where the language is spoken. Overall, interest is greater for reptiles that are venomous, endangered, widely distributed, larger and that have been described earlier." (See also university news release and Wiki Edu blog post)
 * "Gender Gap in Wikipedia Editing: A Cross Language Comparison" From the abstract and conclusions section: "This study is guided by two research questions: RQ1: What is the percentage of users who set their gender in different language editions of Wikipedia? RQ2: Among those who express gender, what percentages comprise female and male contributors? [... We] compared gender across 289 language editions of Wikipedia. [...] We conclude that the differences in the amount(sic) of users expressing their gender can be explained by the differences in the interfaces, both the visibility of gender and the incentive to express it, especially during the process of the new user-profile creation [... The] gender gap is not just present in the English Wikipedia but it is diffused across all language editions of Wikipedia. However, there are notable differences: in some Wikipedias (Slovenian, Estonian, Lithuanian) the percentage of women is close to 40 percent, in others (Bengali, Hindi) it is around 4 percent, while on the English Wikipedia, the chosen baseline given its international nature reaches 17 percent. Notably, languages whose editions of Wikipedia have larger shares of women tend to be spoken in countries with a larger participation of women in science." (See also these general notes on the data source underlying the paper)
 * "Research on Wikipedia Vandalism: a brief literature review" From the abstract: "This paper performs a literature review on the subject, with the goal of identifying the main research topics and approaches, methods and techniques used. Results showed that the authorship of three-quarters of papers are from Computer Science researchers. Main topic is the detection of vandalism, although there is a increasing interest about content quality. The most commonly used technique is machine learning, based on feature analysis. It draws attention to the lack of research on information behavior of vandals."
 * "Wikipedia and participatory culture: Why fans edit" From the abstract: "Building on previous research, I argue that fans want to take part in the production of the media that they enjoy, that Wikipedia allows editors to create their own paratext (i.e., the Wikipedia article) in relation to a main text (e.g., a movie, a television show, a book series), and that this paratext may be heavily used by the general public. Such usage is a form of implicit approval that affirms the editors' knowledge and encourages them to make more edits. Thus, Wikipedia validates the fan editor's work in a way that other outlets for participatory culture (e.g., fan fiction, fan art, songwriting) cannot."
 * "WikInfoboxer: A Tool to Create Wikipedia Infoboxes Using DBpedia" From the abstract: "... we present WikInfoboxer, a tool to help Wikipedia editors to create rich and accurate infoboxes. WikInfoboxer computes attributes that might be interesting for an article and suggests possible values for them after analyzing similar articles from DBpedia. To make the process easier for editors, WikInfoboxer presents this information in a friendly user interface." (See also a related Wikimedia grant application)
 * "Answering End-User Questions, Queries and Searches on Wikipedia and its History" From the abstract: "...we describe and compare two user-friendly systems that seek to make the universal knowledge of Web KBs [knowledge bases] available to users who neither know SPARQL, nor the internals of the KBs. ... the SWiPE ["Search Wikipedia by example"] system provides a wysiwyg interface that lets users specify powerful queries on the Infoboxes of Wikipedia pages in a query-by-example fashion." (See also our earlier related coverage: "Searching by example", "Wikipedia Search Isn’t Necessarily Third BESt")
 * "Cultural Differences in the Understanding of History on Wikipedia" From the abstract: "This paper sheds light on cultural differences in the understanding of historical military events between Chinese, English, French, German, and Swedish Wikipedia language editions. [...] We identified the most important historical events, mined cross-cultural relations, investigated word usage in war-related pages and performed network, complexity, and sentiment analysis. [...] Our findings suggest that World War I and World War II are the most important historical events within English, French, and German cultures and English Wikipedia contains more violence and war-related content, with a higher level of complexity than other language editions."
 * "Predicting Importance of Historical Persons Using Wikipedia" From the abstract: "Based on the two well-known lists of the most important people in the last millennium, we look closely into factors that determine significance of historical persons. We predict person's importance using six classifiers equipped with features derived from link structure, visit logs and article content."
 * "Semantic Stability in Wikipedia" From the abstract: "In this paper we assess the semantic stability of Wikipedia by investigating the dynamics of Wikipedia articles’ revisions over time. In a semantically stable system, articles are infrequently edited, whereas in unstable systems, article content changes more frequently. In other words, in a stable system, the Wikipedia community has reached consensus on the majority of articles. [...] Our experimental results reveal that [...] there are differences on the velocity of the semantic stability process between small and large Wikipedia editions. Small editions exhibit faster and higher semantic stability than large ones. In particular, in large Wikipedia editions, a higher number of successive revisions is needed in order to reach a certain semantic stability level, whereas, in small Wikipedia editions, the number of needed successive revisions is much lower for the same level of semantic stability."
 * "The Citizen IS the Journalist: Automatically Extracting News from the Swarm" From the abstract: "... we describe SwarmPulse, a system that extracts news by combing through Wikipedia and Twitter to extract newsworthy items. We measured the accuracy of SwarmPulse comparing it against the Reuters and CNN RSS feeds and the Google News feed. We found precision of 83 % and recall of 15 % against these sources."
 * "DePP: A System for Detecting Pages to Protect in Wikipedia" From the abstract: "In this paper we consider for the first time the problem of deciding whether a page should be protected or not in a collaborative environment such as Wikipedia. We formulate the problem as a binary classification task and propose a novel set of features to decide which pages to protect based on (i) users page revision behavior and (ii) page categories. We tested our system, called DePP, on a new dataset we built consisting of 13.6K pages (half protected and half unprotected) and 1.9M edits. Experimental results show that DePP reaches 93.24% classification accuracy and significantly improves over baselines."


 * "Bring on Board New Enthusiasts! A Case Study of Impact of Wikipedia Art + Feminism Edit-A-Thon Events on Newcomers" From the abstract: "...our results shows that overall face-to-face edit-a-thons are very successful in attracting and recruiting a large number of newcomers who are more engaged than a random group of newcomers on Wikipedia; however, still a very small percentage of them stay engaged with Wikipedia after the event."