Wikipedia:Wikipedia Signpost/2019-06-30/Recent research

"Trajectories of Blocked Community Members"

 * Reviewed by FULBERT

In Chang and Danescu-Niculescu-Mizil’s study "Trajectories of Blocked Community Members: Redemption, Recidivism and Departure", presented at The Web Conference last month, the authors explored the effectiveness of temporary editor blocks imposed as acts of moderation for inappropriate conduct on Wikipedia. The researchers wanted to understand how effective these blocks were as a tool intended to reform editors whose online conduct was problematic enough to warrant discipline, but not so egregious as to impose an indefinite block, i.e. where there was hope that this action may help reform the editors who demonstrate evidence they can be productive ongoing contributors to Wikipedia.

The researchers limited their study to the first block received by users, beginning with a total sample of 104,245 blocks over the history of English Wikipedia. They focused on four types of blocks that enforced community norms — personal attacks and incivility, harassment, edit warring, and disruptive editing — arriving at a dataset comprising 6,026 blocked users. They were particularly interested in editor departure, which they defined as the time of the last comment a user makes on community talk pages, in part because they focused their attention on community engagement. Likewise, they explored recidivism, which is when a previously-blocked user is blocked again for any additional breach of community rules, compared to those who reform and are never blocked again.

They found that "users are much more likely to depart if they were blocked before, especially in the first months of their life in the community (p. 4)”. Indeed, users are more likely to depart during the month of their block, and this is compounded by the finding that users who were blocked a first time had an increased probability of being blocked again. These findings were impacted by the length of time one was an active member of the user community, where the greater the length of one's involvement, the less likely blocks were to occur, especially through recidivism. The authors then explored user activity level and activity spread, learning that recidivist users engage in interaction with members of the community with more depth than breadth, involving a smaller set of users while continuing more prolonged engagement. Finally, the researchers explored blocked users' perception of fairness, focused through blockage appeals. Those who apologized for their infractions were less likely to suffer from recidivism, while those who directly questioned their blocks or claimed a lack of fairness instead faced a moderate increase in recidivism.

The authors concluded that working toward a "more nuanced approach to moderation that more broadly accounts for the tradeoffs between possible outcomes and that considers how the affected individuals might perceive the moderators and their actions (p. 10)” is something that should be considered when moderators (administrators) consider blocking users for transgressions to community standards.

Conferences and events
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines, and the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer

Hilbert and Newton top the "Wikipedia Network of Mathematicians"
From the abstract:  "We look at the network of mathematicians defined by the hyperlinks between their biographies on Wikipedia. [...] We illustrate how such Wikipedia data can be used by performing a centrality analysis. These measures show that Hilbert and Newton are the most important mathematicians."

Oxford, Cambridge and Harvard top the "Wikipedia Ranking of World Universities"
From the abstract and paper: "We present Wikipedia Ranking of World Universities (WRWU) based on analysis of networks of 24 Wikipedia editions collected in May 2017. With PageRank and CheiRank algorithms we determine ranking of universities averaged over cultural views of these editions. The comparison with the Shanghai ranking gives overlap of 60% for top 100 universities showing that WRWU gives more significance to their historical development. [... Our] approach also determines the influence of specific universities on world countries. We also compare different cultural views of Wikipedia editions on significance and influence of universities." " The top 3 places of WRWU are occupied by Oxford, Cambridge and Harvard while for ARWU [the Shanghai ranking] it is Harvard, Stanford and Cambridge." An earlier, related preprint by some of the same authors was covered by Technology Review under the headline "Wikipedia-Mining Algorithm Reveals World’s Most Influential Universities".

C is the top programming language and Cambridge the top university for computer science, according to (some) Wikipedia network ranking
From the abstract and paper:  We construct 10 different networks from Wikipedia entries (articles) related to the chosen domain. The goal of the experiment is to extract domain knowledge in terms of identifying entries that are centrally positioned and entries that are strongly related. [...] the first task is the construction of a web scraping program which would extract hyperlinks from a Wikipedia entry’s text. The hyperlinks are extracted using a Python package for HTML parsing called Beautiful Soup which parses the HTML structure of a given HTML document into a parse tree. By navigating the tree we locate the tag ID which corresponds to article content ("mw-content-text") and proceed to extract the hyperlinks which themselves are found within paragraph tags and finally inside link () tags in that section of the page. " The various resulting rankings place e.g. "University of Cambridge" 2nd (after "human") by degree centrality in the article network for "computer science", and "C (programming language)" 2nd by closeness centrality in the article network for "programming language" (after "programming language" itself).

"Top 100 historical figures of Wikipedia"
From the abstract:  "The top 100 historical figures of Wikipedia are determined on the basis of statistical methods and mathematical algorithms like PageRank, CheiRank and 2DRank applied to networks of Wikipedia in up to 24 languages. ... This short popular note presents overview of results and methods of different [research] groups..."

"Wikipedia network analysis of cancer interactions and world influence"
From the abstract: "We apply the Google matrix algorithms for analysis of interactions and influence of 37 cancer types, 203 cancer drugs and 195 world countries using the network of 5 416 537 English Wikipedia articles with all their directed hyperlinks. The PageRank algorithm provides the importance order of cancers which has 60% and 70% overlaps with the top 10 cancers extracted from World Health Organization GLOBOCAN 2018 and Global Burden of Diseases Study 2017, respectively. [...] We argue that this analysis of knowledge accumulated in Wikipedia provides useful complementary global information about interdependencies between cancers, drugs and world countries." The resulting ranking is lead by lung cancer, breast cancer and leukemia.

See also earlier coverage of a related paper by the same authors: "World Influence of Infectious Diseases from Wikipedia Network Analysis"

Facebook AI researchers construct the "Wizard of Wikipedia: Knowledge-Powered Conversational agents"
From the abstract:  "We collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses" See also OpenReview discussion

"The dynamics of Wikipedia article revisions: an analysis of revision activities and patterns"
From the abstract:  "We identify four revision patterns: 1) revision actions at the sentence and link levels appear in similar paces; 2) the numbers of revision actions at sentence and link levels comparatively evenly grow with the article's age prior to the last time period; 3) the paces of media and reference-level actions tend to be lagged behind sentence and link-level actions; 4) before being promoted to the GA or FA rank, articles nominated to the GA or FA rank exhibit a significant rising pattern in amounts of revisions and revision actions."

"Interactive Quality Analytics of User-generated Content: An Integrated Toolkit for the Case of Wikipedia"
From the abstract: "Great efforts have been invested in algorithmic methods for automatic classification of Wikipedia articles (as featured or non-featured) and for quality flaw detection. Instead, our contribution is an interactive tool that combines automatic classification methods and human interaction in a toolkit, whereby experts can experiment with new quality metrics and share them with authors that need to identify weaknesses to improve a particular article. A design study shows that experts are able to effectively create complex quality metrics in a visual analytics environment. In turn, a user study evidences that regular users can identify flaws, as well as high-quality content based on the inspection of automatic quality scores."

"The Impact of Topic Characteristics and Threat on Willingness to Engage with Wikipedia Articles: Insights from Laboratory Experiments"
From the abstract and paper:  "We presented the introduction parts of 20 Wikipedia articles and asked participants to rate each article with respect to familiarity and controversiality. In addition, we experimentally manipulated participants’ level of mortality salience in terms of the amount of threat they experienced when reading the article. Participants also indicated their willingness to engage with a particular article. The results revealed that willingness to engage with a Wikipedia article was predicted by both topic familiarity and controversiality of a given article. Although mortality salience increased accessibility of death-related thoughts, it did not result in any changes in people’s willingness to work with the articles. [...] Sixty-one participants reported that they read Wikipedia articles at least once a week. They indicated an average reading time of M = 1.57 h per week (SD = 1.11)."

"With Few Eyes, All Hoaxes Are Deep"
From the abstract: "we describe a long-standing set of inefficiencies that have plagued new page patrolling by drawing a contrast to the more efficient, distributed processes for counter-vandalism. Further, to address this issue, we demonstrate an effective automated topic model based on a labeling strategy that leverages a folksonomy developed by subject specific working groups in Wikipedia (WikiProject tags) and a flexible ontology (WikiProjects Directory) to arrive at a hierarchical and uniform label set. We are able to attain very high fitness measures [...] and real-time performance using word2vec-based features. Finally, we present a proposal for how incorporating this model into current tools will shift the dynamics of new article review positively."

"Finding Prerequisite Relations using the Wikipedia Clickstream"
From the abstract:  "Where should the learner start? What should the learner know before tackling a new course? [...] we present a new method for identifying prerequisite relations based on naturally occurring data, namely the navigation patterns of users on the Wikipedia online encyclopedia. Our supervised learning approach shows that the navigation network structure can be used to identify dependencies among concepts in several domains."

"Query for Architecture, Click through Military: Comparing the Roles of Search and Navigation on Wikipedia"
From the abstract:  "...we study large-scale article access data of the English Wikipedia in order to compare articles with respect to the two main paradigms of information seeking, i.e., search by formulating a query, and navigation by following hyperlinks. To this end, we propose and employ two main metrics, namely (i) searchshare -- the relative amount of views an article received by search --, and (ii) resistance -- the ability of an article to relay traffic to other Wikipedia articles -- to characterize articles. We demonstrate how articles in distinct topical categories differ substantially in terms of these properties. For example, architecture-related articles are often accessed through search and are simultaneously a 'dead end' for traffic, whereas historical articles about military events are mainly navigated."

"How Article Topic, Quality and Dwell Time Predict Banner Donation on Wikipedia"
From the abstract:  "In this work, using a dataset of aggregated donation information from Wikimedia’s 2015 fund-raising campaign, representing nearly 1 million pages from English and French language versions of Wikipedia, we explore the relationship between the properties of contents of a page and the number of donations on this page. Our results suggest the existence of a reciprocity mechanism, meaning that articles that provide more utility value attract a higher rate of donation. We discuss these and other findings focusing on the impact they may have on the design of banner-based fundraising campaigns."