Wikipedia:Wikipedia Signpost/2015-02-25/Recent research

"First Women, Second Sex: Gender Bias in Wikipedia"

 * by Maximilianklein (talk)

The problem of the Gender Gap in Wikipedia can mean several things; a gap in editors, or a gap in the content, and of course the relationship between the two. An arXiv preprint titled "First Women, Second Sex: Gender Bias in Wikipedia" addresses the gap on the content side, with justification by many Simone de Beauvoir quotes. The authors use an ensemble of three methods—DBPedia metadata, language modelling, and network theory—to show not just inequality in encyclopedia inclusion, but degrees of sexism in how biographies are included. For instance, how different genders meet notability is quantifiably different, as is the centrality of biographies in their link structure.

The initial metadata technique is an inspection of DBPedia data mashed up with a separate dataset from previous research based on pronoun counting techniques. This method is a bit shaky as it relies on the combination of two derived datasets, especially in an era when Wikidata can deliver data closer to the source. Nevertheless the researchers find that 15.5% of their final dataset are women biographies. Digging further, biographies are separated by subclass: athletes, politicians, military-personnel, and all others are more heavily male—only artists and royalty are female-biased. Other findings from this type of infobox scraping is that female biographies are much more likely to have the spouse parameter filled.

Moving into the natural language realm, the paper inspects bigrams of the biographies' text. The top words associated with men are "played", "football" and "league"; for women, the top are "actress", "women's" and "her husband". This already starts to hint at the notion that men are notable for what they do, rather than only their static characteristics. To investigate further, Linguistic Inquiry and Word Count (LIWC) and two measures—frequency and burstiness—are employed for semantic classification. The semantic category where male biographies score significantly higher is cognitive mechanics, which encompasses words like "became", "known", and "made"; meanwhile female biographies have significantly more sexual words like "love", "passion", and "sex".

The last domain explored is network structure. Each biography links to and is linked from other biographies, forming a directed graph. The first interesting thing to note is that in chi-squared testing between 4 link types (female–female, female–male, male–male, male–female), only female-female occur more than expected. Next a PageRank ranking is made of the graph, which determines the importance or "centrality" of biographies. Any subsetting of biographies by removing the least PageRanked articles, it is found, reduces the female ratio of the subset below the total figure.

The authors wrap up their conclusions within the context of feminist theory. They argue the notion of gender roles is evident in Wikipedia in the way that metadata shows that men are more often known to be sportspeople, and women to be artists, royalty or spouses of someone else. Likewise the language of biographies is biased. That "her husband" and "first woman" are top terms in female articles indicates a failure in the Finkbeiner test. Furthermore the authors claim this exhibits "objectification" in light of the evidence that the "cognitive processes" of men were shown to be more significant than women, and that the "sexual" category is the only one in which women are more frequently described than men. Finally, as viewed from the network structure results, female biographies are less central to the encyclopedia. This is said to be because of historical philosophy and today's notability guidelines, that "reason and objectivity are gendered male"—a feminist metaphysical view. The explanation of female articles tending to link to other female articles more than expected, the authors imagine, is due to women-led gender gap addressing efforts.

Overall this article provides a wide variety of methods to measure the gender gap, which proves a high-level point from many perspectives. It is situated in feminist thought, but multiple returns to Beauvoir make the final analysis seem superficial and generic. Additionally, the simplifying assumptions of English-only and derived datasets leave open the criticism that the larger points cannot be disentangled from a few extra biases introduced by language- and processing-inherited lenses. The authors admit as much in their limitations when they also acknowledge not questioning the gender binary either. What we have here though is an increment to a growing pile of methods and techniques proving the gender gap which, ideologically, does not need, but can always benefit from additional statistical legitimacy.

Wikipedia’s SOPA Strike considered as international political movement

 * Review by NeilK (talk)

A paper in Current Sociology written by prolific Wikipedian (and contributor to this research newsletter) Piotr Konieczny revisits the SOPA Strike. This was a 24-hour blackout of the English Wikipedia in 2012 to protest against proposed American copyright legislation, accompanied by tools for citizens to contact their representatives on the issue. The author argues this event demonstrates a new political opportunity structure for international movements, such as the free culture movement, to influence national policies.

A chronology of the events leading up to the SOPA Strike on Wikipedia is presented. The author then analyzes Wikipedia’s forums debating whether and how to restrict access to the site for a day. Debate participants are classified by such characteristics as national origin, history of editing Wikipedia, and stated arguments for and against. Simple quantitative analyses of population percentages and relative contribution are performed. Konieczny then tests various hypotheses about the nature of the protest, to see which one fits the facts.

Konieczny shows that experienced Wikipedians were generally supportive of a protest but were more likely to express misgivings about losing neutrality. Americans also participated in a greater proportion than their prevalence on the English Wikipedia. However the process also allowed non-US citizens and free culture idealists to have significant leverage over the debate on Wikipedia, and thus on American national politics. Konieczny tries to show that Wikipedia is thus an international social movement in the broader free culture movement. Konieczny ends the paper with a speculation that the many pro-blackout single-purpose accounts may reflect a new political consciousness among the young and internet-savvy.

Konieczny's analysis gives us a very detailed, fascinating picture of what arguments were made in public on Wikipedia forums during a crucial few weeks. However, this may omit some of the most influential discussions, by insiders, taking place person-to-person and in chat rooms. The paper also omits discussion of the influence of the Wikimedia Foundation, as an American institution responding to an American legal threat.

When Konieczny asserts the existence of a rising transnational "Net Generation", he's presented very little evidence. A skeptical or quietist Wikipedian might still conclude that the encyclopedia wasn't acting as an organ of democracy, but was briefly overrun by a Twitter trending topic. If Konieczny is right, we may see other internet-based communities also being pressed into service, or more permanent institutions being developed to serve this new community.

Full disclosure: I (NeilK) was intimately involved with the SOPA Strike movement on Wikipedia, as a technologist on the WMF staff, and as a concerned Wikipedian who weighed in on the very forums analyzed in this paper, in favor of a blackout.

Assignment designed to convince students of Wikipedia's "fundamental untrustworthiness" achieves the opposite
An article in Communications in Information Literacy reports on the outcome of a senior-level course at Duquesne University where students "created or modified a Wikipedia entry and tracked the modifications made by others to the entry, while they also explored the concept of the ‘wisdom of crowds’ in contrast to the ‘wisdom of experts’ through the course readings and discussions". The class also wrote a new article collectively (Paramount Film Exchange (Pittsburgh)), and engaged in various breaching experiments. E.g. "the instructor inserted a defamatory falsehood into the page of Luke Ravenstahl, the mayor of Pittsburgh at the time, and asked students to see how long it took the falsehood to disappear. Within five minutes, it was gone." One student created an article that "seeks to promote a specific company, Accord Curtains, and it is purposefully manipulative." Another student vandalized an article about an NFL player and "Not even 5 seconds later, I had a message from a Wikipedia policeman informing me about the repercussions of doing such a thing to a Wikipage... It really opened my eyes as to how incredible and powerful the internet is to society."

Students subsequently wrote papers answering the question "What are Wikipedia pages good for?". Two and a half years after the class, participants were asked what they had learned about Wikipedia from the assignment for their post-college life. Five of them responded (a rather small sample, a limitation admitted by the authors), largely sticking to the judgment they had expressed in their original papers, reporting that "they came into the class convinced that Wikipedia was an unreliable source but that learning about the creation and community editing of Wikipedia pages made the site more reliable to them."

In the paper's conclusion, the authors comment:
 * "The instructor came into the unit assuming that he would be ushering students into an epiphany: Wikipedia, a source they loved and relied upon and rarely questioned, was actually rife with junk information because anyone—even they—could change anything at will. ... How this failed! The students took away the pragmatic lesson that Wikipedia was generally reliable, almost always useful, and that its self-policing mechanisms were mostly effective, particularly when it came to popular or especially controversial pages."

Similar findings are reported in an unrelated case study, titled "Attitude Changes When Using Wikipedia in Higher Education", which involved 23 students at Williams College, evaluating their "attitudes before and after participating in collaborative wiki assignments. Results from the study showed a statistically significant positive shift in attitudes [about Wikipedia and wikis in general] before and after using the wiki."

Reasons for contributing: Ego vs. social norms in the US and South Korea
This study, roughly, asks why people are uploading (contributing) content to Wikipedia, comparing respondents from two culturally different countries, namely collectivist South Korea and the individualistic United States. It uses the usual convenience sample of college students (reached through an online survey). In a 2012 survey involving only Korean students (previous coverage: "Do social norms influence participation in Wikipedia?"), the authors had found that users might be motivated by the fact that "uploading content on Wikipedia is a socially desirable act".

In the present study, the authors test whether a number of factors are positively correlated with intent to upload content on Wikipedia, based on the psychological theories such as theory of planned behavior, situational theory of problem solving, and roles of ego involvement (which represents the self-concept of individuals), subjective norm (a person’s perception of the social pressure to perform or not to perform the behavior in question), and descriptive norm (beliefs about what is actually done by the majority of one’s social circles).

In total, the authors present nine hypotheses. Ego involvement is found to be highly significant, but not differentiating between two cultures, which the author interpret as an an indicator that globalization and the Internet are bridging the cultural gap, an interesting conclusion that deserves further discussion. The norms are found to be mostly irrelevant (only the descriptive norm is significant for the American sample group, and—contrary to the prior studies on Korean Internet users with regard to the subjective norm—neither is for the Korean one), as is the attitude on uploading behavior. Another possible explanation offered by the authors regarding the small difference between the two cultures concerns the individualistic values embedded in, or self-oriented nature of, Web 2.0 applications and social media, and the author repeat their proposition that it is likely due to globalizing factors (suggesting that the young Korean generation, despite living in a collectivist culture, is significantly affected by individualistic global media). Overall, the authors conclude that cultural differences play a relatively small role in explaining the differences in American and Korean attitudes towards uploading content to Wikipedia.

The study also reports on the interestingly low popularity of Wikipedia in South Korea: only about 50% of Korean students used Wikipedia, whereas almost 99% of American students did. The authors did propose some interesting explanations for this finding (such as a hypothesis that uploading content on Wikipedia might be regarded as a challenge to the established authority of traditional encyclopedias), but unfortunately they are not backed up with any significant evidence. Given South Korea's popular image as one of the most advanced countries when it comes to Internet use, the issue of Wikipedia's poor popularity there—as the authors note themselves—is one that is worth investigating in future studies.

Undergraduates confused by references in Wikipedia articles
It is no surprise that students like to use Wikipedia. A paper in New Library World adds to the debate on the perceptions, motivations, and attitudes of students who use this site by asking the following research question: "How do undergraduates actually use Wikipedia and how does this resource influence their subsequent information-gathering?" The study used the usual convenience sample of 30 American undergraduates, who were given a topic (Internet privacy), directed to the corresponding page, and asked to draft a paper on that topic, using Wikipedia as their starting point. Of particular interest to us are the author's comments on Wikipedia's references. First, there's the (unfortunately, short and unjustified) comment that "it is common for Wikipedia articles to have two or more “Notes” and “References” sections, which [is] confusing". Second, that "following Wikipedia references were least preferred as next steps in the research process", about as likely as "going to the library catalog", and less so than "going to Google for more information," "accessing the library’s databases", or simply "returning to Wikipedia". When asked which Wikipedia references they would follow if they were to do so, there was a significant preference for the references cited first, regardless of their quality. A number of respondents expressed an opinion that first references are somehow "better", not realizing that Wikipedia footnotes are ordered simply by the order they appear in the article. Regarding their use of Wikipedia itself, "respondents overwhelmingly indicated that they used Wikipedia because it was easy to access" (similar to Google), thus displaying a marked preference for convenience, visibility and accessibility over authority and quality of the source or their bibliographies. The authors also note that while the students understand that, in theory, scholarly sources are the best (and better than Wikipedia), they are more interested in "reasonably good" than "accurate" information, either because of difficulties in accessing / interpreting the "most credible" sources, or perhaps because of their skepticism towards authority.

The author concludes that one of the best solutions is to involve students in the process of creation and editing of Wikipedia pages, though she sees that as a method to educate students about Wikipedia's imperfections, rather than as a way to improve Wikipedia's quality, a task she seems to regard as better suited for faculty and librarians. She also offers some worthwhile suggestions to "Wikipedia developers" regarding the goal of pursuing collaboration with academic libraries, by noting that "it may be worth for Wikipedia to develop a visualized ranking mechanism for its references"—an idea that is certainly worth discussing further.

Briefly
This study is not very broad in terms of number or types of articles in question; only 79 articles were considered. And given the naming of their archetypes, clearly the authors aren't aware that Wikipedians have already transcended into classifying themselves by an entire ecosystem of WikiFauna.
 * ClueBot as a rebel among conquerors, followers and cowboys: There are four archetypes of Wikipedians on featured articles: Conqueror, Follower, Rebel, and Cowboy, according to the article "Measuring Creativity of Wikipedia Editors" . The study investigated the quantity and rate of change of edits among editors over time, paying attention to their relative positions. The article describes the four personas of editors on the article Boston. A conqueror shows strong bursts of activity, sustains high volume over time, and is a first mover. A follower is a low volume, but still sustained, and positively correlated to a conqueror. A rebel—which hilariously they found ClueBot, the software, to be—is low volume, sustained, but negatively correlated to a conqueror. Lastly, a Cowboy is erratic with spikey contributions, and uncorrelated to other users.
 * Using Wikipedia to correct public misconceptions about Africa: An article titled "Wikipedia for Africanists", coauthored by Hans Muller, a Wikipedian in Residence at the African Studies Centre in Leiden (Netherlands), describes the usefulness of Wikipedia for that academic discipline: "Using Wikipedia, Africanists can benefit in two ways: as readers they can quickly obtain a sourced but non-academic outline of topics of interest, and as outreach writers, they can inform the public worldwide about recent insights and attempt to solve (the many) misunderstandings on African topics with unprecedented efficiency."
 * Geographic distribution of Wikimedia traffic: The Wikimedia Foundation's Oliver Keyes published "a highly-aggregated dataset of readership data" of Wikipedia (representing an additional effort to exclude non-human traffic compared to previous data). Work is ongoing to create data visualizations.

Other recent publications
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
 * "Ranking Wikipedia article's data quality by learning dimension distributions" ; summary at kurzweilai.net: "Using Bayesian statistics to rank Wikipedia entries: Algorithm outperforms a human user by up to 23 percent in correctly classifying quality rank of articles, say researchers"
 * "A visual editor for the Wiki Object Model" (German bachelor thesis adapting Wikipedia's VisualEditor to other wikis)
 * "Use and Perception of Wikipedia among Medical Students in a Nigerian University" From the abstract: "[In a survey with 60 respondents,] 91.7% of the medical students have used Wikipedia;... 50.9% of the students use Wikipedia to complement lecture notes, 43.6% for research project as well as to complete class assignment, 14% of them use it to modify content of articles; ... the challenges faced by the students are scantiness of information of some articles, unavailability of/inability to obtain articles on some topics from the site, and inaccuracy/unreliability of content of articles."
 * "Where Non-Science Majors Get Information about Science and How They Rate that Information" From the abstract: "We report on a study of 400 undergraduate non-majors students enrolled in introductory astronomy courses at the University of Arizona ... Overall, students reported getting information from a variety of online sources when looking up a topic for their own knowledge, including internet searches (71%), Wikipedia (46%), and online science sites (e.g. NASA) (45%). When asked where they got information for course assignments, most reported from assigned readings (82%) but a large percentage still reported getting information from online sources such as internet searches (60%), Wikipedia (30%) and online science sites (e.g. NASA) (20%). Overall, students rated professors/teachers and textbooks at the most reliable sources of scientific information and rated social media sites, blogs and Wikipedia as the least reliable sources of scientific information."
 * "Integration of multiple network views in Wikipedia" From the abstract: "[We analyze] the networks of editors interacting on Wikipedia pages. We propose the prediction of article quality as a task that allows us to quantify the informativeness of alternative network views. We present three fundamentally different views on the data that attempt to capture structural and temporal aspects of the edit networks."
 * "Experimental evaluation of learning performance for exploring the shortest paths in hyperlink network of Wikipedia" From the abstract: "...in three separate learning sessions of 20 minutes students read series of 62 sentences built by using 22 unique hyperlinks that form the eleven shortest paths and answered pre-test and post-test multiple-choice questionnaires about recall of sentences ... For experiment group (n=24) 62 sentences were chained in such an ordering that corresponds to traversing cumulatively a series of associative trails leading from concept Tourism in Malta to concept Euro coins of Malta along alternative parallel shortest paths in hyperlink network of Wikipedia category Malta. For control group (n=10) same sentences had randomized ordering. For both unique hyperlinks and consecutive pairs of hyperlinks experiment group reached higher degrees of recall than control group". (See also Wiki Game)
 * "Educational exploration based on conceptual networks generated by students and Wikipedia linkage" (by the same author)
 * "Citations to Wikipedia in Canadian Law Journal and Law Review Articles"
 * "Advances in Wikipedia-based Interaction with Robots"
 * "Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods"
 * "Identifying Featured Articles in Spanish Wikipedia"  From the abstract: "...the first study to automatically assess information quality in Spanish Wikipedia, where Featured Articles identification is evaluated as a binary classification task. Two popular classification approaches like Naive Bayes and Support Vector Machine (SVM) are evaluated ..."
 * "Predicting the Popularity of Trending Articles in the Arabic Wikipedia Using Data Mining Techniques"
 * "Revision history: Translation trends in Wikipedia" From the abstract: "This paper uses Mossop's taxonomy of editing and revising procedures to explore a corpus of translated Wikipedia articles to determine how often transfer and language/style problems are present in these translations and assess how these problems are addressed."