Wikipedia:Wikipedia Signpost/2019-12-27/Recent research

"Using Wikipedia to promote acoustics knowledge for the International Year of Sound 2020"

 * Reviewed by 

This online presentation hosted by the Journal of the Acoustical Society of America discusses Wiki4YearOfSound2020 (at meta-wiki), co-sponsored by Acoustical Society of America, the Centers for Disease Control and Prevention, National Institute for Occupational Safety and Health (NIOSH) and others. The Year of Sound is an attempt "to make content in acoustics one of the better-developed areas within Wikipedia". The presentation lauded Wikipedia's scope and reach, comparing NIOSH's 150,000 web pages and 8 million visits per year to English Wikipedia's 5.5 million articles and 260 million views per day. The presentation referenced the results of an engagement for World Hearing Day 2019 in which hearing-related topics garnered tens to hundreds of thousands of views, with the article tinnitus receiving over 340,000 views. It also discusses NIOSH's many other engagements with Wiki Ed since 2015 in which students expanded occupational safety and health content.

Wiki Workshop 2019

 * Overview by Tilman Bayer

The fifth annual "Wiki Workshop", a section of The Web Conference, was held in San Francisco in May this year. Wiki Workshop 2020 will take place in Taipei, Taiwan in April. The call for papers closes on January 17.

Papers presented at Wiki Workshop 2019 included:

"Thanks" feature has "strong positive effect on short-term editor activity"
From the abstract:  "The Thanks feature on Wikipedia [...] is a tool with which editors can quickly and easily send one other positive feedback [....] We study the motivational impacts of “Thanks” because maintaining editor engagement is a central problem for crowdsourced repositories of knowledge such as Wikimedia. Our main findings are that most editors have not been exposed to the Thanks feature (meaning they have never given nor received a thank), thanks are typically sent upwards (from less experienced to more experienced editors), and receiving a thank is correlated with having high levels of editor engagement. [...] We empirically demonstrate that receiving a thank has a strong positive effect on short-term editor activity across the board and provide preliminary evidence that thanks could compound to have long-term effects as well." See also research project page on Meta-wiki

"What’s in the Content of Wikipedia’s Article for Deletion Discussions? Towards a Visual Analytic Approach"
From the paper:  "we retrieved about 40,000 deletion discussion content from Wikipedia web pages, cleaned and stored the content of 39,177 discussions into a structured discussion database. With this cleaned and structured dataset, the automatic processing and analysis of the discussion content becomes more manageable to the researchers. With this database, we developed interactive visualizations that offer insights on how the outcomes of the articles are related to different aspects of the discussions, including the types of votes, the mentioned policies, and the categories of the articles (e.g., art, people, sport, etc.). [...]

According to Wikipedia's AfD policy, the decision regarding a proposed article should be based on the participants' rationales in their votes, as opposed to simply following the majority vote based on the total number of different opinions in the discussion. Curious to find out whether the actual decision-making of AfD discussions follows this policy, we visualized the AfD discussion outcomes and the number of keep/delete opinions in the discussions through an interactive heat map. [...] This interactive visualization is accessible at http://www.mandanemedia.com/afd/view/diagram1.php. [... The resulting figure] indicates that the outcome of an AfD decision is in general consistent with the majority vote rule: the more keep votes than delete votes a discussion has, the more likely its outcome is to keep the article. [... However, part of the diagram] illustrates that the decision is not made simply by majority vote in the discussion. [...]

To better understand the distribution of the AfD discussions according to the number of keep and delete votes, we developed another interactive heat map that paints the color of the cell according to the total number of the discussions in the cell (http://www.mandanemedia.com/afd/view/diagram2.php ). [...]

We developed an interactive visualization to offer an overall understanding about the policies that the participants mentioned during the AfD discussion (http://www.mandanemedia.com/afd/view/diagram3.php ). Specifically, we used a sunburst diagram to represent the policies mentioned in the discussion content and their relative frequencies. [...]

Interested in the distribution of various categories in the proposed articles, we developed another sunburst diagram that shows the categories of the articles proposed for delete (http://www.mandanemedia.com/afd/view/diagram4.php ) [...]

We developed an interactive visualization to offer an overall mapping between the articles' categories and the policies mentioned in their AfD discussions (http://www.mandanemedia.com/afd/view/diagram5.php )."

See also video of a related presentation by one of the authors at the July 2019 Wikimedia Research showcase



An awareness campaign in India did not affect Wikipedia pageviews, but a new software feature did
From the abstract:  "Understanding how various external campaigns or events affect readership on Wikipedia is important to efforts aimed at improving awareness and access to its content. In this paper, we consider how to build time-series models aimed at predicting page views on Wikipedia with the goal of detecting whether there are significant changes to the existing trends. We test these models on two different events: a video campaign aimed at increasing awareness of Hindi Wikipedia in India and the page preview feature roll-out—a means of accessing Wikipedia content without actually visiting the pages—on English and German Wikipedia. Our models effectively estimate the impact of page preview roll-out [as independently determined via A/B tests], but do not detect a significant change following the video campaign in India."

"How Partisanship and Perceived Political Bias Affect Wikipedia Entries of News Sources"
From the abstract:  "Whether political bias affects journalism standards appears to be a debated topic with no clear consensus. Meanwhile, labels such as “far-left” or “alt-right” are highly contested and may become cause for prolonged edit wars on the Wikipedia pages of some news sources. In this paper, we try to shine a light into this phenomenon and its extent, in order to start a conversation within the Wikipedia community about transparent processes for assigning political orientation and journalistic reliability labels to news sources, especially to unfamiliar ones, which users would be more likely to verify by looking them up." See also our coverage of a related recent paper involving one of the authors: "The Importance of Wikipedia in Assessing News Source Credibility"

"A Graph-Structured Dataset for Wikipedia Research"
From the abstract:  "While [Wikipedia is] a scientific treasure, the large size of the dataset hinders pre-processing and may be a challenging obstacle for potential new studies. This issue is particularly acute in scientific domains where researchers may not be technically and data processing savvy. On one hand, the size of Wikipedia dumps is large. It makes the parsing and extraction of relevant information cumbersome. On the other hand, the API is straightforward to use but restricted to a relatively small number of requests. The middle ground is at the mesoscopic scale, when researchers need a subset of Wikipedia ranging from thousands to hundreds of thousands of pages but there exists no efficient solution at this scale.

In this work, we propose an efficient data structure to make requests and access subnetworks of Wikipedia pages and categories. We provide convenient tools for accessing and filtering viewership statistics or “pagecounts” of Wikipedia web pages. [...] The dataset and deployment guidelines are available on the LTS2 website https://lts2.epfl.ch/Datasets/Wikipedia/ " See also coverage of an earlier paper by the same authors

"Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata"
From the abstract:  "News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and to compare with past similar events. However, searching for precise facts described in stories is hard. In this paper, we propose a general method that leverages the Wikidata knowledge base to produce semantic annotations of news articles."

"Inferring Advertiser Sentiment in Online Articles using Wikipedia Footnotes"
This paper presents a method to automatically identify online articles (e.g. in news media) that report negatively about a brand, by starting from the dataset of citations in the "Criticism" section of the Wikipedia article about that brand (e.g. Uber). The stated purpose is to use "an online user’s history of viewed articles" (which, as the authors observe, is typically accessible to "online advertising platforms [which are] in partnerships with media companies") to target ads to those users who have been exposed to negative sentiment about the brand.

"Learning to Map Wikidata Entities To Predefined Topics"
From the abstract:  "Given a piece of text, [an entity disambiguation and linking system (EDL)] links words and phrases to entities in a knowledge base, where each entity defines a specific concept. Although extracted entities are informative, they are often too specific to be used directly by many applications. These applications usually require text content to be represented with a smaller set of predefined concepts or topics, belonging to a topical taxonomy, that matches their exact needs. In this study, we aim to build a system that maps Wikidata entities to such predefined topics. [...] we propose an ensemble system that effectively combines individual methods and yields much better performance, comparable with human annotators." The resulting tool is part of a "system for information extraction from noisy user generated text such as that available on social media" by the company Lithium Technologies (now Khoros), the authors' employer at the time the research was done.

English Wikipedia's medical articles are of higher quality than those of Portuguese Wikipedia
From the abstract:  "... we evaluate and compare the quality of medicine-related articles in the English and Portuguese Wikipedia. For that we use metrics such as authority, completeness, complexity, informativeness, consistency, currency and volatility, and domain-specific measurements [...]. We were able to conclude that the English articles score better across most metrics than the Portuguese articles."

"Understanding Travel from Web Queries Using Domain Knowledge from Wikipedia"
From the abstract:  "... we are interested in extracting specific Wikipedia entities associated with hospitality and travel along with relevant metadata. [...] For a hotel, that would mean, we extract the parent company, say ‘Wyndham’, and then the list of establishments owned by the parent such as ‘Days Inn’, ‘La Quinta’, ‘Ramada’, ‘Super 8’ and ‘Wyndham Grand’. For each of these, we extract their tier of service such as ‘upscale’, ‘mid scale’, ‘boutique’, and ‘economy’. [...] We start with the Wikipedia page: ‘List of lists of lists’, which points to several other lists. [...] In addition to the lists, we also used the Category pages such as ‘Category:Vehicle rental companies’ [sic] and ‘Category:Travel and tourism templates’ as secondary starting points [...]. These pages yield the company names and the brand names. Next, we derive info such as the tier of service and the locations, where appropriate, using the sections, info boxes and subcategories within the pages for each of the brands we extracted. [...] Remarkably, the travel activities annotated using Wikipedia extractions agreed with [human] editorial review over 65% of the time."

"Open Personalized Navigation on the Sandbox of Wiki Pages"
From the abstract:  "we present a proof-of-concept of a visual navigation tool for a personalized “sandbox” of Wiki pages. The navigation tool considers multiple groups of algorithmic parameters and adapts to user activity via graphical user interfaces. The output is a 2D map of a subset of [the] Wikipedia pages network which provides a different and broader visual representation – a map – in the neighborhood (according to some metric) of the pages around the page currently displayed in a browser."

See also previous coverage of other papers from Wiki Workshop 2019:
 * "Understanding the Signature of Controversial Wikipedia Articles through Motifs in Editor Revision Networks"
 * "Finding Prerequisite Relations using the Wikipedia Clickstream"

Briefly

 * The Wikimedia Foundation's Research team has published an overview over its activities during the past six months, intended to become the first in a series of biannual reports published every June and December.
 * Researchers' concerns about masking or abolishing IP edits: A recent proposal by Wikimedia Foundation staff to hide the IP addresses of users who choose to contribute without an account (known as "IP editors" or "anonymous editors") has been met with community concerns that such a change will harm abuse mitigation, and even some calls to rather altogether abandon the option to edit without an account. In turn, a group of eight academic researchers around Benjamin Mako Hill published an open letter presenting evidence for their belief "that the negative consequences of blocking edits from non-accountholders would far outweigh the benefits." Separately, there are also concerns about the effect of IP masking on research that uses IP edit data (e.g. to geolocate anonymous edits).
 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''
 * Compiled by Tilman Bayer

In its war coverage, "Wikipedia shows greater conciliatory potential" than museums
From the "Conclusion" section:  "This article has found that the Wikipedia entries on the war between China and Japan in the 1930s and 1940s differ substantially from the narratives in traditional technologies of memory, such as museums, in that they do not focus to the same extent on particular experiences or episodes. In this sense, Wikipedia shows greater conciliatory potential as it apparently fulfils the first of the two steps necessary for reconciliation, namely that the two sides remember more or less the same events and episodes."

"The Global Popularity of William Shakespeare in 303 Wikipedias"
From the abstract:  "... for a plurality of Wikipedias, almost fifty, Romeo and Juliet is number one in pageviews, while in many, but fewer others, it is Hamlet. In seven Wikipedias, on the other hand, Macbeth is number one, while Julius Caesar is first in still several others. Othello, King Lear, A Midsummer Night’s Dream, As You Like It, and Antony and Cleopatra are the only other rare leaders in specific Wikipedias. In short, this article will present the basic popular global reception information about all of Shakespeare’s works, filling a lacuna in critical research ..."

"Mapping the backbone of the Humanities through the eyes of Wikipedia"
From the "Highlights" section:  "We analyze how scientific knowledge is established in the field of the Humanities. ... The citation average to Humanities articles in Wikipedia is lower than the general. ... Of the 25 most cited journals on Wikipedia, none is open access."