Wikipedia:Wikipedia Signpost/2014-05-28/Recent research

"Wikipedia in the eyes of its beholders: A systematic review of scholarly research on Wikipedia readers and readership"
This paper is another major literature review of the field of Wikipedia studies, brought forward by the authors whose prior work on this topic, titled "The People’s Encyclopedia Under the Gaze of the Sages" was reviewed in this research report in 2012 ("A systematic review of the Wikipedia literature").

This time the authors focus on a fragment of the larger body of works about Wikipedia, analyzing 99 works published up to June 2011 on the theme of "Wikipedia readership" – in other words focusing on the theme "What do we know about people who read Wikipedia". The overview focuses less on demographic analysis (since little research has been done in that area), and more on perceptions of Wikipedia by surveyed groups of readers. Their findings include, among other things, a conclusion that "Studies have found that articles generally related to entertainment and sexuality top the list, covering over 40% of visits", and in more serious topics, it is a common source for health and legal information. They also find that "a very large number of academic in fact have quite positive, if nuanced, perceptions of Wikipedia’s value." They also observe that the most commonly studied group has been that of students, who offer a convenience sample. The authors finish by identifying a number of contradictory findings and topics in need of further research, and conclude that existing studies have likely overestimated the extent to which Wikipedia's readers are cautious about the site's credibility. Finally, the authors offer valuable thoughts in the "implications for the Wikipedia community" section, such suggesting "incorporating one or more of the algorithms for computational estimation of the reliability of Wikipedia articles that have been developed to help address credibility concerns", similar to the WikiTrust tool.

The authors also published a similar literature review paper summarizing research about the content of Wikipedia, which we hope to cover in the next issue of this research report.

Chinese-language time-zones favor Asian pop and IT topics on Wikipedia
A paper presented at the WWW 2014 Companion Conference analyzes the readership patterns of the English and Chinese Wikipedias, with a focus on which types of articles are most popular in the English- or Chinese-language time zones. The authors used all Wikipedia pages which existed under the same name in both languages in the period from 1 June 2012 to 14 October 2012 for their study, coding them through the OpenCalais semantic analysis service with an estimated 2.6% error rate.

The authors find that readers of the English and Chinese Wikipedias from time-zones of high Chinese activity browse different categories of pages. Chinese readers visit English Wikipedia about Asian culture (in particular, Japanese and Korean pop culture) more often, as well as about mobile communications and networking technologies. The authors also find that pages in English are almost ten times as popular as those in Chinese (though their results are not identifying users by nationality directly, rather focusing on time zone analysis).

In this reviewer's opinion, the study suffers from major methodological problems that are serious enough to cast all the findings in doubt. Apparently because the authors were unaware of Interlanguage links and consider only articles which have the same name (URL) in both the English and Chinese Wikipedians, they find that only 7603 pages were eligible to be analyzed (as they had both an English and Chinese version), however the Chinese Wikipedia in the studied period had approximately half a million articles; and while many don't have English equivalents yet, to expect that less than 2% did seems rather dubious. Similarly, our own WikiProject China estimates that English Wikipedia has almost 50,000 China-related articles. That, given that WikiProject assessments are often underestimating the number of relevant topics, and usually don't cover many core topics, suggests that the study missed a vast majority of articles that exist in both languages. It is further unclear how English- and Chinese-language time-zones were operationalized. The authors do not reveal how, if at all, they controlled for the fact that readers of English Wikipedia can also come from countries where English is not a native language, and that there are hundreds of millions of people outside China who live in the five time zones that span China, which overlap with India, half of Russia, Korea and major parts of Southeast Asia. As such, the findings of that study can be more broadly interpreted as "readership patterns of English and Chinese Wikipedia in Asia and the the world, regarding a small subset of pages that exist on both English and Chinese Wikipedia."

"Bipartite editing prediction in Wikipedia"

 * Reviewed by Maximilianklein (talk)

Bipartite Editing Prediction in Wikipedia is a paper wherein the authors aim to solve what they call the "link prediction problem". Essentially they aim to answer "which editors will edit which articles in the future." They claim the social utility of this is to suggest articles to edit to users. So in some ways this is a similar function to SuggestBot, but using different techniques.

Their approach here is to use a bipartite network modelling. A bipartite network is a network with two node-types, here editors and articles. Using bipartite network modelling is becoming increasingly trendy, like Jesus (2009) and Klein (2014).

Explaining their method, the researchers outline their two approaches: "supervised learning" and "community awareness". In the supervised learning approach the machine learning features used are Association Rule, K-nearest neighbor, and graph partitions. All these features, they state, can be inferred directly from the bipartite network. In the community awareness approach, the Stanford Network Analysis Project tool is used to cut the network into co-editor sets, and then go on to inspect what they call indirect features which are sum of neighbors, Jaccard coefficient, preferred attachment, and Adamic–Adar score.

The authors proceed to give a table of their results, and highlight their highest achieving precision, and recall statistics which are moderate and contained in the interval [.6, .8]. Thereafter a short non-interpretive one-paragraph discussion concludes the paper saying that these results might be useful. Unfortunately they are not of much use, since while they declare their sample size of 460,000 editor–article pairs from a category in a Wikipedia dump, they don't specify which category, or even which Wikipedia they are working on.

This machine learning paper lacks sufficient context or interpretation to be immediately valuable, despite the fact that they may be able to predict with close to 80% F-measure which article you might edit next. Therefore the paper is a good example of the extent to use Wikipedia for research without even feigning attempt to make the research useful to the Wikipedia community, or even frame it in that way.

Briefly

 * "Increasing the discoverability of digital collections using Wikipedia: the Pitt experience": In this paper, a librarian at the University of Pittsburgh discusses how two undergraduate interns have added over 100 links to library collections to Wikipedia articles, which led to the increase use of the library's digitized collections. An experienced Wikipedian, Sage Ross, provided help with this project. The two undergrads expanded or created approximately 100 articles, mainly related to the History of Pittsburgh (such as Pittsburgh Courier or Pittsburgh Playhouse), using resources hosted by the university's libraries as sources or external links. The paper also provides a valuable overview of similar initiatives in the past (some of which have also been covered in this research report, see e.g.: "Using Wikipedia to drive traffic to library collections"). The majority of reviewed examples suggest that linking library resources from Wikipedia pages increases their visibility, and this study reached the same conclusion with regards to their project, which led both the improvement of Wikipedia content and of driving more traffic to the digital resources hosted by the library. This reviewer applauds this project as a model one, though it would benefit from a list of all articles edited by the students (which were not tagged on their talk pages with any expected template, such as educational assignment).
 * Korean survey on "Key Factors for Success" of Wikipedia and Q&A site: This paper compares aspects of Wikipedia and  South Korean Naver's "Naver Knowledge" service (see Knowledge Search), similar to Google Questions and Answers. This is a topic of some interest, as South Korea is praised for being one of the most Internet-integrated societies in the world, while at the same time the Korean Wikipedia currently holding the rank of 23rd largest, is less developed than those of a number of smaller countries less commonly seen as Internet powers (consider List of Wikipedias by size). The researchers surveyed 132 Korean Internet users of those services, though they do not make it clear if all members of the sample were in fact registered contributors to both services, instead describing them as "relative active users of the CI [collective intelligence] system". Unfortunately, parts of the paper, including the survey questions, appear to have been translated using machine translation, and are thus difficult to interpret correctly. Overall, the authors find that there were no significant differences with regards to the respondents views of Naver Knowledge and Wikipedia services. One of the statistically significant results suggest that Korean contributors of collective intelligence services find the Naver Knowledge service easier to use than Wikipedia, though the differences do not appear to be major (73.5% and 60.9% of Korean contributors found Naver Knowledge and Wikipedia easy to work with, respectively). One of the conclusions of the paper is the importance of making user interfaces as easy as possible, and making it easier for the users to add and edit audiovisual content (though the authors seem not aware of and do not discuss the Visual Editor).
 * "Citation filtered": This glossy and infographic-laden report dissects the 963 Persian Wikipedia articles that are blocked in Iran. The technique used was to programmatically iterate over Wikipedia to see which articles could not be loaded. Categorizing the articles into 10 topics, an analysis of the Iranian Government's sensitivities are explored. From the Annenberg School of Communication, University of Pennsylvania blog. (Maximilianklein (talk))
 * "Georeferencing Wikipedia documents using data from social media sources": This paper describes several methods to automatically assign geocoordinates to articles on the English Wikipedia: By matching the article text to hashtags of georeferenced tweets, to tags of georeferenced photos on Flickr, and to the text of other Wikipedia articles that are already georeferenced. The authors report that "using a language model trained using 376K Wikipedia documents, we obtain a median error of 4.17 km, while a model trained using 32M Flickr photos yields a median error of 2.5 km. When combining both models, the median error is further reduced to 2.16 km. Repeating the same experiment with 16M tweets as the only training data results in a median error of 35.81 km". As one possible application, the authors suggest automatic correction of coordinates for Wikipedia articles where their method predicts a differing location with high confidence. Among their test dataset of 21,839 articles with a geocoordinate located in the United Kingdom, the authors found three such errors, one of which was still uncorrected at the time of their preprint publication (an educational institution in Brussels which had been placed in Cornwall due to a sign error in the longitudinal coordinate). Another interesting byproduct is a visual comparison (figure 5) of the density of geolocated entries from Wikipedia, Twitter and Flickr in Africa (per the datasets used).

Other recent publications
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
 * "Snuggle: Designing for efficient socialization and ideological critique"
 * "Preferences in Wikipedia abstracts: Empirical findings and implications for automatic entity summarization"
 * "Cluster approach to the efficient use of multimedia resources in information warfare in wikimedia" (from the abstract: "A new approach to uploading files in Wikimedia is proposed with the aim to enhance the impact of multimedia resources used for information warfare in Wikimedia.")
 * "From open-source software to Wikipedia: ‘Backgrounding’ trust by collective monitoring and reputation tracking" (from the abstract: "It is shown that communities of open-source software—continue to—rely mainly on hierarchy (reserving write-access for higher echelons), which substitutes (the need for) trust. Encyclopedic communities, though, largely avoid this solution. In the particular case of Wikipedia, which is confronted with persistent vandalism, another arrangement has been pioneered instead. Trust (i.e. full write-access) is ‘backgrounded’ by means of a permanent mobilization of Wikipedians to monitor incoming edits. ... Finally it is argued that the Wikipedian monitoring of new edits, especially by its heavy reliance on computational tools, raises a number of moral questions that need to be answered urgently.")