Wikipedia:Wikipedia Signpost/2012-01-30/Recent research

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, edited jointly with the Wikimedia Research Committee and republished as the Wikimedia Research Newsletter.

Admins influence the language of non-admins
An Arxiv preprint titled "Echoes of power: Language effects and power differences in social interaction" looks at the language used by Wikipedia editors. The authors look at how conversational language can be used to understand power relationships. The research analyzes how much one adapts their language to the language of others involved in a discussion (the process of language coordination). The findings indicate that the more such adoption occurs, the more deferential one is. The authors find that editors on Wikipedia tend to coordinate (language-wise) more with the administrators than with non-administrators. Further, the study suggests that one's ability to coordinate language has an impact on one's chances to become an administrator: the admin-candidates who do more language coordination have a higher chance of becoming an administrator than those who don't change their language. Once a person is elected an administrator, they tend to coordinate less.

A blog post on the website of Technology Review summarized the results using the headline "Algorithm Measures Human Pecking Order" and highlighted the fact that one of the authors is Jon Kleinberg, known as inventor of the HITS algorithm (also known as "hubs and authorities").

Can Wikipedia replace commercial biography databases?
An article by a librarian and professor at California State University, East Bay offers a comparison of "biographical content for literary authors writing in English" between Wikipedia, "the web" (i.e. top Google search results) and two commercial databases: the Biography Reference Bank (BRB, now part of EBSCO Industries) and Contemporary Authors Online, motivated by the decision of the author's institution to cancel its subscription to the latter database (CAO) during a budget crisis in 2008–2009, which among other reasons had been accompanied by "a comment that this information is 'on the web'".

The paper starts out with a literature review on the reliability of Wikipedia and then describes how the author compiled a list of 500 authors (mostly from the US and UK) by "examining curricula and textbooks from English literature courses across the USA" and soliciting additional suggestions from peers. These names were then searched on BRB, CAO (as part of the Literature Resource Center), Wikipedia and Google.

Regarding breadth of coverage, only six of the 500 names were "absent" on Wikipedia (meaning that they had "no entry of their own or reference in any other entry"), compared to 14 for CAO, and 50 for the Biography Reference Bank.

While the study does not seem to have attempted a systematic comparison of factual accuracy, it observes that Wikipedia "entries are less uniform than those in commercial databases. The biographical information ranges from extensive to perfunctory."

The author remarks favorably on Wikipedia's searchability:
 * "The databases and Wikipedia deal better than the Web with variant names, pseudonyms, and names that apply to multiple people. Cross-referencing is very good. [...] Wikipedia searching is very easy.There were even cases where it was easier to search Wikipedia than the databases. [...] Wikipedia also 'disambiguates' names and offers quick descriptions to enable the searcher to find the correct individual."

A large part of the comparison consists of examining each resource's production process. Wikipedians may find parallels to their policies on biographies of living people, self-published sources and notability in the description for the Biography Reference Bank:
 * ''"Current Biography [the main content source of BRB] articles rely on secondary sources, but Wilson [the then publisher] has occasionally spoken directly with subjects or their proxies. Upon publication, many articles have been sent to subjects for review before being updated for the print annual and the databases. If subjects raise objections, misinformation is corrected, but not matters of public record. Adjustments may be made for privacy, for example omitting the specific names of children.
 * ''"To be included in World Authors [another source of BRB], authors must have published more than one critically acclaimed book. [...]"
 * ''"For autobiographies, Wilson attempted to contact subjects in Junior Authors and World Authors for a statement, but not subjects in Current Biography. [... An example offered by a Wilson employee:] For some reason, Jennie Tourel, a Russian-American opera singer, often provided false information, but, according to the Wilson biography, “passports and other documents that surfaced soon after her death helped to correct some of these inaccuracies'".

In the conclusion, the author answers the initial question by recommending that her employer "re-subscribe to a commercial biographical database" if the budget would permit it again, because "Commercial databases provide a foundation with authoritative core content authenticated prior to publication and integrated with the fabric of information in the library’s holdings. They are easy to search and reliable, although they cannot be as current as Wikipedia or the Web because of their authentication processes. Wikipedia become [sic] more impressive as searching proceeded. The focus may be on verifiability rather than authority and there may be challenges in securing contributors, but the current contributors provide citations and often include unique information." All in all she seems to favor Wikipedia and the two databases over "The web" (Google results) which "may have plenty of dross and be less reliable, harder to search, and focused on commercialism, but there are gold nuggets." She worries: "What will happen if contributors to Wikipedia and the web have no authoritative databases to use as sources?"

Students predict connections between Wikipedians
Among the student projects in a class on "Computational Analysis of Social Processes" at Rensselaer Polytechnic Institute, three analyzed social networks of Wikipedia editors:


 * The write-up for a project titled "Interaction vs. Homophily in Wikipedia Administrator Selection" provides an analysis of factors related to one's participation (or lack thereof) in the Request for Adminship discussions. It confirms previous findings that many participants are drawn to the discussions by their personal contacts and experiences with others. The paper tries to analyze the impact of direct past interaction versus homophily (roughly defined as shared interests). The findings suggest that homophily plays a much smaller role compared to past interactions. Overall, it appears that administrators are often elected (or opposed) not by the community at large, but by a group of their closest peers. To quote from the conclusion of the paper: "This raises questions about the robustness of Wikipedia's administrator selection process which is then comprised of a very small interaction-selected group of editors."
 * Another project write-up titled "Link Prediction Analysis in the Wikipedia Collaboration Graph" tested various models to predict the strength of the connection between two Wikipedia editors in a "dynamic collaboration graph" that measures, at a given point in time, how often they recently edited the same page, with more recent edits weighing stronger.
 * A third student paper titled "Link prediction on a Wikipedia dataset based on triadic closure" likewise tested various models on a similar graph consisting of Wikipedia users as vertices, regarding the closure of triangles (i.e. if user A is connected with B, and B with C, is A connected with C as well?). Among the conclusions is that such "triadic closure, while still occurring in Wikipedia, is happening at a slower pace now than before–likely due to the influx of less active editors".

Language analysis finds Wikipedia's political bias moving from left to right
A study presented earlier this month at the annual meeting of the American Economic Association which is to appear in The American Economic Review sets out to test whether the English Wikipedia is truly neutral, by measuring bias within a sample of 28,000 entries about US political topics, examined over a decade. The bias is identified through detecting the use of language specific to one side of the American political scene (Democrats or Republicans). To quote from the article: "In brief, we ask whether a given Wikipedia article uses phrases favored more by Republican members or by Democratic members of Congress" (in the text of the 2005 Congressional Record, using a method developed in an earlier paper by Gentzkow and Shapiro who applied it to newspapers). The authors identified, as of January 2011, 70,668 articles related to US politics, about 40% of which had a statistically significant bias. They find that Wikipedia articles are often biased upon creation, and that this bias rarely changes. Early on in Wikipedia's history, most had a pro-Democratic bias, and while "by the last date, Wikipedia's articles appear to be centered close to a middle point on average", this is simply an effect of a larger amount of new pro-Republican articles than due to the existing ones having been rewritten neutrally.

While the authors made efforts to exclude articles not pertinent to US politics (requiring the terms "United States" or "America" to appear at least three times in the article text), the sample also includes the clearly international article Iraq War. And in what Wikipedians may call out as systemic bias, the authors never question their assumption that for an international encyclopedia, a lack of bias would be indicated by the replication of the spectrum of opinions present in the US Congress. As early as 2006, Jimmy Wales objected to such notions with respect to the community of contributors: "If averages mattered, and due to the nature of the wiki software (no voting) they almost certainly don't, I would say that the Wikipedia community is slightly more liberal than the U.S. population on average, because we are global and the international community of English speakers is slightly more liberal than the U.S. population. ... The idea that neutrality can only be achieved if we have some exact demographic matchup to [the] United States of America is preposterous." Nevertheless, even if one turns the study on its head and reads it as a statement on average American political opinion compared to the rest of the world as reflected in the English Wikipedia, its results remain remarkable.

Briefly

 * Calls for papers have appeared this month for
 * WikiSym 2012, the eighth instance of this annual research conference on wikis and open collaboration
 * Wikimania 2012, the eighth annual global conference of Wikimedians
 * Wikipedia Academy: Research and Free Knowledge, a conference organized by Wikimedia Germany
 * "Academic research into Wikipedia: Beyond English Wikipedia and towards comparative perspectives", an upcoming issue of the e-journal Digithum
 * New effort at comprehensive wiki research literature database: Wikipedian emijrp has announced the launch of WikiPapers, a Semantic MediaWiki-based wiki dedicated to the "compilation of resources (conference papers, journal articles, theses, books, datasets and tools) focused on the research of wikis". The task of creating such a database has seen several efforts before and its difficulties were explored in a well-attended workshop at last year's WikiSym conference (see the October issue of this newsletter). Researcher Finn Årup Nielsen (who last year published an overview of such literature, mentioning well over 1000 publications) pointed out the possibility of exchanging content between the new wiki, the existing (likewise Semantic MediaWiki-based) Acawiki and his own Brede Wiki.
 * Review of Good Faith Collaboration: Sociological journal The Information Society reviewed Joseph Reagle's 2010 book Good Faith Collaboration: The Culture of Wikipedia (which was recently released online under a Creative Commons license), praising it as "an accurate account of this sociocultural and sociotechnological phenomenon that Wikipedia is". The reviewer calls Wikipedia a "virtual tool and reference jim-dandy [which] is another flashpoint in our path of social anxieties" and holds that "nobody can think of a true rival [of Wikipedia] in this knowledge contest. The sins are there: lightness, temporary reliability, questionable scholarly approaches, sometimes oversimplification, sometimes data excess; however, these are venial sins and easily absolved." Somewhat cryptically, he observes that "the European Union is trying to adapt this part of Western academia to the global university system (though inevitably Anglo-American inspired)". He commends the book as an "accessible analysis [which] makes it clear that Wikipedia is not wasted knowledge; it is human thirst for knowledge and we are simply gathering scattered pieces". Gentle criticism includes that "though [...] inaccuracies are stated, two other important worries — that it is not financially sustainable, and that Wikipedia has lost touch with its founding ideal—are not as openly dealt with".
 * Predicting categories from links: In a paper titled "Using Network Structure to Learn Category Classification in Wikipedia" (the write-up of a class project for an Autumn 2011 Stanford course titled "Social and Information Network Analysis"), three students describe the construction of a classifier algorithm that tries to predict from an article's ingoing and outgoing wikilinks whether it is a member of the Category:American actors – "We chose this particular category because it is one of the largest on Wikipedia (almost 25,000 pages)".
 * Wikipedia vs. library catalogue: An article in Library and Information Research titled "Searching where for what: A comparison of use of the library catalogue, Google and Wikipedia" analyzed search queries from users of Google (using Hitwise data) and Wikipedia, and a state library in Australia, unsurprisingly finding that the library catalogue is used much less frequently than the former two, but positing that the "fact that popular culture queries accounted for [a very] substantial proportion of Google and Wikipedia queries and almost no [library] catalogue queries indicates that, indeed, people do turn to different information resources for different subjects."
 * "Lexical clues" predict article quality: A paper was presented at the 3rd Symposium on Web Society (SWS) last October which sought to predict article quality based on eight different ratios derived from counting the number of sentences, words, diverse words, nouns, verbs, diverse nouns, diverse verbs and copulas in the article text. They trained a decision tree on a sample of 200 start-class and 200 featured articles (truncating each of the latter to 800 to 1000 words to arrive at a typical start-class article length) and then tested it on a different sample of 100 start-class and 100 featured articles, achieving precision and recall of more than 83% each.