Wikipedia:Wikipedia Signpost/2013-04-29/Recent research

Too good to be true? Detecting COI, Attacks and Neutrality using Sentiment Analysis
Finn Årup Nielsen, Michael Etter and Lars Kai Hansen presented a technical report on an online service which they created to conduct real-time monitoring of Wikipedia articles of companies. It performs sentiment analysis of edits, filtered by companies and editors. Sentiment analysis is a new applied linguistics technology which is being used in a number of tasks ranging from author profiling to detecting fake reviews on online retailers. The form of visualization provided by this tool can easily detect deviation from linguistic neutrality. However, as the authors point out, this analysis only gives a robust picture when used statistically and is more prone to mistakes when operating within a limited scope.

The service monitors recent changes using an IRC stream and detects company-related articles from a small hand-built list. It then retrieves the current version using the MediaWiki API and performs sentiment analysis using the AFINN sentiment-annotated word list. The project was developed by integrating a number of open source components such as NLTK and CouchDB. Unfortunately, the source code has not been made available and the service can only run queries on the shortlisted companies which will limit the impact of this report on future Wikipedia research. However, it seems to have potential as a tool for detecting COI edits that tend to tip neutrality by adding excess praise or attacks which tip the content in the other direction. We hope the researchers will open-source this tool like their prior work on the AFINN data-set, or at least provide some UI to query articles not included in the original research.

"A Comparative Study of Academic impact and Wikipedia Ranking"
A paper with this title investigates the relation between the scientific reputation of scientific items (authors, papers, and keywords) and the impact of the same items on Wikipedia articles. The sample of scientific items is made of the entries in the ACM digital library including more than 100 k papers, 150 k authors and 35 k keywords. However, only a tiny subset of these could be found in English Wikipedia pages (the authors considered all Wikipedia pages in the English edition which contain at least two mentions of any of the scientific items in the sample). The academic reputation is calculated based on three criteria: frequency of appearance, number of citations each item receives from the others, and PageRank calculated on the citation network. The Wikipedia ranking is based on three popularity measures of all the pages that have mentioned the item: number of mentions, sum over PageRank of all the mentioning pages, and sum over in-degrees of all the mentioning pages in Wikipedia's hyperlink network.

These 3 times 3 choices give 9 combinations of academic ranking and Wikipedia ranking for 3 types of scientific entities (authors, papers, keywords). All these 27 pairs are shown to be correlated according to Spearman's Rank Correlation, indicating that in general Wikipedia mentions are non-randomly driven by scientific reputation. However, most of the combinations are less significant. Surprisingly, the most relevant Wikipedia ranking criterion turns out to be the pure total number of mentions, compared to the more sophisticated ones, i.e., PageRank and in-degree measures.

In a separate part, authors define two sets of scientific items, those which are mentioned in Wikipedia, and those which are not mentioned at all (the latter is larger in size by a factor of 2 for keywords, 100 for authors, and 300 for papers). They show that for all 3 types, the set of items which are mentioned in Wikipedia have a better academic rank on average.

1970s UNESCO debate applied to Wikipedia's systemic bias in the case of Cambodia
An article in the Journal of the American Society for Information Science and Technology rated the quality of Wikipedia articles on the history of Cambodia (defined as those linked in the corresponding navbox, using four measures: 1) the article's ratio of the number of citations per the number of words, 2) the number of editors who have commented on its talk page, 3) the quality of the cited sources, rated in five categories ("traditional reference" like print encyclopedias, "news reports" including both newspapers and news websites such as CNN, "academic periodicals", "books", and "miscellany" like reports by governments or NGOs, or personal websites) and 4) "the number of unique authors cited", assuming that articles which are based on a larger variety of perspectives are of higher quality. The findings are summarized as follows: "The early history of Cambodia is represented by an extremely weak article, but there is an improvement in the articles dealing with the early kingdoms of Cambodia. The improvement ends abruptly with articles on the 'dark age' of Cambodia, the French Protectorate, the Japanese occupation, and early postindependence periods being of a much lower quality. Afterward, the quality picks up again with especially good articles on the American intervention in Cambodia, the Cambodian-Vietnamese War, and the People's Republic of Kampuchea. However, the quality does not last; as we near contemporary times, the articles take another turn for the worse." From this, the author concludes that "the Wikipedia community is unconsciously mimicking the general historiography of the country", in particular a glorification of Angkor and other early kingdoms at the cost of later periods, and observes a "continuing dominance of the traditional historiographical narrative of Cambodian history in Wikipedia."

The subsequent section of the paper tries to put these results into the context of the historical debates in the late 1970s and early 1980s about the New World Information and Communication Order (NWICO), a suggested remedy for problems with the under-representation of the developing world in the media, put forth by a UNESCO commission in the MacBride report (1980): "Wikipedia provides access—it is free to use by anyone with an Internet connection, and print versions can also be distributed. But the whole thrust of the NWICO argument is that content matters and those who create content matter perhaps even more, with the commission stressing that countries needed to 'achieve self-reliance in communication capacities and policies' ... Contrary to popular belief, in the new 'information age' content is, once again, the preserve of the few, not the many, and a geographically concentrated few at that." The author's argument is somewhat weakened by asserting erroneously that "there exists no Cambodian-language Wikipedia", but generally aligns with other quantitative research that has found a geographic unevenness of coverage in Wikipedia. The author is an information studies professor at Singapore's Nanyang Technological University and previously published a related paper in the same journal examining the Wikipedia article History of the Philippines, reviewed in the August issue: "The limits of amateur NPOV history".

Reasons why wikilinks are added and removed
Julia Preusse, Jerome Kunegis, Matthias Thimm, Thomas Gottron and Steffen Staab investigate mechanisms of changes in a wiki that are of structural nature, i.e., which are a direct result of the wiki's linking structure. They consider if the addition and removal of internal links between pages can be predicted using just information about the network connecting these articles. The study's innovation lies in considering the removal of links, which account for a high proportion of removals and reverts. The authors performed an empirical study on Wikipedia, stating that traditional indicators of structural change used in the link analysis literature can be classified into four classes, which indicate growth, decay, stability and instability of links. These methods were then employed to identify the underlying reasons for individual additions and removals of knowledge links.

The network created by links between articles in Wikipedia is characterized by preferential attachment. Prior work on social networks has identified a phenomenon called "liability of newness", in which new connections are more likely to be broken than older ones. To provide a better predictive model of link evolution the team considered five hypotheses: To test these hypotheses, they created networks based on the history of the mainspace articles till 2011 of the top five Wikipedias after the English one. For example, in the French Wikipedia, 41.7 million links were added and 17.3 million removed during that time. The data was used to create a link creation predictor and a link removal predictor. These were then evaluated using the area under the receiver operating characteristic curve.
 * 1) Preferential attachment: The number of adjacent nodes is a good indicator for link addition.
 * 2) Embedding : The embeddedness of a link is suitable to predict the appearance of links and the non-disappearance of existing links.
 * 3) Reciprocity: The presence of a link makes the addition of a link in the opposite direction more likely and the removal of a reciprocal link less likely.
 * 4) Liability of Newness: Old age of an edge or a node is a good indicator for link persistence.
 * 5) Instability The less stable two nodes are, the less stable the link connecting them is, or would be if it does not exist.

The results were that Preferential attachment and Embedding are good indicators of growth. Liability of Newness did not turn out to be a good indicator of link removal, but more of article instability. Reciprocity is also an indicator of growth, but is not as significant since most links in a wiki are not reciprocated.

Generation Z judges [[Generation Z ]], questioning role of amphetamines
An article in the Journal of Information Science, titled "Understanding trust formation in digital information sources: The case of Wikipedia", explores the criteria used by students to evaluate the credibility of Wikipedia articles. It contains an overview of various earlier studies about credibility judgments of Wikipedia articles (some of them reviewed previously in this space, example: "Quality of featured articles doesn't always impress readers").

The authors asked "20 second-year undergraduate students and 30 Master’s students" in information studies to first spend 20 minutes reading "a copy of a two-page Wikipedia article on Generation Z, a topic with which students were expected to have some familiarity", and answer an open-ended question explaining how they would judge its trustworthiness. In a subsequent part, the respondents were asked to rank a list of factors for trustworthiness in case of "either (a) the topic of an assignment, or (b) a minor medical condition from which they were suffering". One of the first findings was a "low pre-disposition to use [Wikipedia], possibly suggesting a propensity to distrust, grounded on debates and comments on the trustworthiness of Wikipedia" – possibly to the fact that the example article contained an example of vandalism, a fact highlighted by several respondents (e.g. "started off as a valid entry ... due to citations strengthening this ... however came to the last paragraph and the whole document was marred by the insert of 'writing articles on Wikipedia while on amphetamines' [as purported hobby of Generation Z members]... just feels that you can't trust anything now").

Among the given trustworthiness factors, the following were ranked most highly: "authorship, currency, references, expert recommendation and triangulation/verification, with usefulness just below this threshold. In other words, participants valued having articles that were written by experts on the subject, that were up to date, and that they perceived to be useful (content factors). ... Interestingly these factors all seemed more or less equally important for both contexts, with the exception of references, which for predictable reasons were seen as having greater importance in the context of assignments."

Visualizing the "flow of ideas" on Wikiversity
In a conference paper titled "Analyzing the flow of ideas and profiles of contributors in an open learning community" (see also audience notes from the presentation), the authors construct a graph from the set of revisions of a set of Wikiversity pages, with two kind of edges: 1) "Update edges", linking a page's revision to the directly subsequent revision. These are understood as representing "knowledge flow over the course of the collaborative process on a single wiki page". 2) "Hyperlink edges" between two revisions of different pages with a wikilink between them - but pointing in the opposite direction, because the idea is that they indicate knowledge flowing from the linked page to the linking page. By requiring the source node of a hyperlink edge "as the latest revision of the hyperlinked page at the moment of creation of the target revision", both kinds of links point forward in time, resulting in a two-relational directed acyclic graph (DAG), which is "depicting the knowledge flow over time." After filtering out "redundant" hyperlink edges and attaching authorship information to each node (page revision).

The authors apply this procedure to a set of Wikiversity articles in the area of medicine, starting with v:Gynecological History Taking. The results are interpreted as follows: "the beginning, short after the category medicine was founded, the authors in this category built up the basic structure of the knowledge domain. The main relations and idea flows between the learning materials were established early in the development of the domain. After that the authors have been focusing on elaborating the articles without introducing new important hyperlinks. The overall picture of the learning process in this domain suggests a divergent evolution of ideas after an initial period of mutual fertilization between different topics. This conforms to the idea of groups of learners that followed different interests in the medicine domain with little inter-group collaboration on the creation of new shared learning resource." The method is subsequently applied to profile the activities of various users.

The authors have integrated these algorithms, including visualization tools, into a "network analytics workbench ... used in the ongoing EU project SISOB which aims to measure the influence of science on society based on the analysis of (social) networks of researchers and created artifacts."

In brief

 * Wikipedia Vs. Encyclopedia Britannica: A Longitudinal Analysis: The authors review how Wikipedia and Britannica coverage of topics related to several major corporations has changed in the past 6 years. They find unsurprisingly that Wikipedia coverage is usually much more detailed than that of Britannica; more interestingly, they note that one of the key differences is that Wikipedia focuses more on issues such as corporate social responsibilities and legal and ethical issues, whereas Britannica will focus more on traditional aspects such as financial results. They note that both encyclopedias, while striving towards some form of neutrality, contain non-neutral ("positively and negatively framed") content, although it is more common to find it in Wikipedia. They also note that this content seemed to peak around 2008–2010, and attribute it to the negative views of major corporations common among the general public around that time, whose view was more likely to be represented on Wikipedia than on Britannica, also correlating this with the economic recession. The authors note that increasingly, knowledge available to the general public comes from social media collaboration projects such as Wikipedia, and are doubtful whether more traditional models like that of Britannica have a future. See also earlier coverage of a related paper by the same authors: "Are articles about companies too negative?" and of another where one of them (controversially) argued against the "bright line" rule on conflict of interest editing: "Wikipedia in the eyes of PR professionals".
 * Wikipedia uses in learning design: A literature review: presents a relatively useful literature review of publications about the "teaching with Wikipedia" approach. The authors analyzed several scholarly databases (not explaining, however, why the selected ones were chosen and others were not), finding 30 works on related themes, and selecting 24 of those. They provide a number of useful breakdowns (2/3 of the works deal with higher education, 1/3 with secondary, none with primary) and analyze expected learning outcomes (the most popular being learning research methodology), knowledge fields that the papers represented (mostly fields of social science), and an overview of student tasks. While containing few revelations, the paper is a solid example of a literature review of an emerging field, and contains a valuable observation that more research is needed on how Wikipedia is used by elementary school students.
 * Wikipedia assignment has positive impact on students "research persistence": A paper by two Californian librarians, titled "From Audience to Authorship to Authority: Using Wikipedia to Strengthen Research and Critical Thinking Skills" and presented at the recent conference of the Association of College and Research Libraries (ACRL), describes the results of two case studies  "in which Wikipedia was used as the platform for assignments" for students, one of which "had, overall, a positive impact on research persistence" of the students.
 * Co-authorship patterns around Pope Francis, and Boston bombing views: In his blog, Brian Keegan (known to readers of this research report for his previous research of Wikipedia's coverage of breaking news events) provides a refreshing preview of his upcoming research with visualization of Co-authorship patterns around the Pope Francis article. The Social network analysis produced by factoring edits from 607 editors who worked on the new pope's article and then adds all other articles they collaborated on since. The results which look like abstract artwork show a number of complex patterns in the data. However we will have to wait until the publication of the paper for these to be explained. Keegan does provide a number of teasers but you will have to visit his blog to read these. In another blog post, Keegan examines pageviews of various articles related to the Boston marathon bombings.
 * Mining content removed from articles on breaking news events. A short paper accepted to the 2013 WWW conference describes a new tool designed for mining the information removed from Wikipedia articles during breaking news events. The Wikipedia Event Reporter identifies "bursts" of editing activity in an article, then uses machine learning techniques to identify sentences from the revision history of the article that were added during these bursts but which are not contained within the current version, and finally displays this information to the user—all in real time! The designers of the tool state that the Event Reporter will be useful for "a journalist or a student studying about history [who wants] a comprehensive view of an event, and not only the socially accepted final interpretation". While Event Reporter looks to be both useful and intriguing, this reviewer challenges the assumptions behind the authors' intended scenario of use. On Wikipedia, information about breaking news events is often removed because it is factually incorrect, not for the "sake of brevity", out of considerations of political correctness, or other (possibly nefarious) social motives. The authors do not address the issue of determining factual accuracy in their paper—hopefully their intended audience (journalists!) will keep that issue in mind if they decide to re-publish the mined information. The reviewer would also like to have seen a performance evaluation of their Vector Machine Classifier, which relies on hand-labelled training data, included in the paper. Nonetheless, this seems to be a fascinating and very powerful piece of software. One cool future direction for the Event Reporter team might be to mine the content of the article talk page during and directly after these bursts as well, and employ the same classification technique to provide the end user with a better sense of why certain content was revised or removed.
 * Spam on the rise as reason for user blocks: User:Ironholds examined the English Wikipedia's block log from 2006 to 2012 for the stated blocking reasons, and found "spam" being used more and more frequently.
 *  10k birth places and 40k almae matres from Wikipedia biographies, human-vetted: Google has published "a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of "place of birth", and over 40,000 examples of "attended or graduated from an institution". Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems."
 * How Wikipedia's Google matrix differs for politicians and artists: Continuing the authors' research on the Google matrix of Wikipedia articles and links between them (earlier coverage: 'Wikipedia communities' as eigenvectors of its Google matrix"), an ArXiv preprint studies the "Time evolution of Wikipedia network ranking", finding among other things that "PageRank selection is dominated by politicians while 2DRank, which combines PageRank and CheiRank, gives more accent on personalities of arts".
 * A Wikipedia search algorithm that emphasizes serendipity: A two-page paper to be presented at the upcoming WWW 2013 conference explores algorithms for "Searching for Interestingness in Wikipedia and Yahoo! Answers", or "Serendipitous search" - defined as "when a user with no a priori or totally unrelated intentions interacts with a system and acquires useful information". The authors modify some standard information retrieval metrics  by including sentiment analysis and a measure of a page's quality – in the case of Wikipedia, "the number of dispute messages inserted by editors to require revisions", which may be seen as questionable. The resulting two algorithms for ranking search results on both sites are tested for some popular search terms (drawn from the Google Zeitgeist lists), by asking test subjects to rank the results "for relevance, interestingness to the query, and interestingness regardless of the query". In the end, the authors suggest that they be combined into a hybrid system.
 * Usability study recommends 18-point font for Wikipedia: An "experi­ment with 28 participants with dyslexia [comparing] reading speed, comprehension, and subjective readability" found "that font size has a significant effect on the readability and the understandability of the text, while line spacing does not". On that basis, the four researchers from Barcelona "recommend using 18-point font size when designing web text for readers with dyslexia.
 * OpenSym, Wikisym, ClosedSym?: This year the WikiSym conference will be co-located with OpenSym. This marks a step forward from a conference focused mostly on CSCW application of wiki technology to a broader investigation of OpenCulture. This year the conferences will be collocated with Wikimania 2013 in Hong Kong. The WikiSym conference is funded in part by a grant by the WMF. However, as reported last month (Wikimedia funding for Wikisym '13 despite open access concerns), there has been debate about the tension between the requirements of WMF on supporting open access research and the fact that the conference papers will be published by the ACM – a closed access provider. In two subsequent blog posts, the organizers explain why they have not been able to find an open publisher with a reputation comparable to the ACM for the 2013 proceedings, but formulate requirements for a suitable publisher for next year.
 * Wikimedia France research award winner announced: The French Wikimedia chapter has announced the winner of its research award: "HTML Can history be open source? Wikipedia and the Future of the Past", an influential 2006 essay by the historian Roy Rosenzweig, which received the most votes among the jury-selected five finalists. The prize money of € 2,500 will go to the Roy Rosenzweig Center for History and New Media, founded by the paper's late author.
 * Provenance graphs: A conference paper by two computer scientists from the University of Newcastle presents code to convert metadata from Wikipedia revision history and user contribution pages (e.g. the author of a particular revision, or articles edited by an editor) into provenance data in the W3C's PROV-DM data model. The graph of revisions and editors is visualized. Code and examples are provided on Github.