Wikipedia:Wikipedia Signpost/2016-08-04/Recent research



Making it easier to navigate within article networks via better wikilinks

 * Reviewed by Jonathan Morgan and Tilman Bayer

A paper to be presented at the 2016 OpenSym conference titled "Evaluating and Improving Navigability of Wikipedia: A Comparative Study of Eight Language Editions" attempts to determine which links should appear at or near the top of a Wikipedia article. Using the Wikipedia clickstream data, the four researchers found that on the English Wikipedia, an average of 30% of article traffic comes via links from other Wikipedia articles. They cite previous research (coauthored by some of the same authors, like a third related paper ; see also here) which has shown that most readers focus their attention on the content of an article that appears "above the fold"—usually just the lede section and the top of the infobox. This suggests that it would be easier for readers to browse related content if links to the most relevant related articles appeared near the top of an article, which would improve the overall navigability of Wikipedia. Their goal is to make it easier for readers to move between related pages by determining which articles are most closely related to a given article, and should therefore be linked in the top section where readers are mostly likely to see them.

The researchers determined the relationships between articles without using keywords or categories. Instead, they generated a type of directed network graph called a bow tie model, which determines how closely the linked article is to the article that links to it based on the relationships between other articles that link to and from both of them. By looking at the links present within different 'views' of a large set of articles (e.g. the first lede paragraph, the whole lede section, the infobox, or the entire page) across different wikis, the researchers could quantify how much related content is accessible from the part of an article that readers are mostly likely to see. They concluded by describing how their research findings could be used to create a system for recommending links that should be included in article ledes and infoboxes.

The paper also sheds new light on the widely discussed phenomenon that by clicking on the first link in an Wikipedia article and repeating the process, one will eventually arrive at the article philosophy:
 * "For the English Wikipedia we indeed find that the vast majority of articles (97%) leads to the cycle containing the philosophy article. This finding also holds true for a large majority of articles in the German, French and Russian Wikipedias. For the Spanish and Italian Wikipedias, the dominant cycle contains the article on psychology, while for Dutch, the dominant cycle consists of the articles on knowledge and know-how."

New research journal about Wikipedia and higher education
A new journal called Wiki Studies is being launched. As explained by founding editor Bob Cummings (a professor for Writing and Rhetoric at the University of Mississippi and author of a 2009 book titled "Lazy virtues: teaching writing in the age of Wikipedia"):
 * Wiki Studies is an interdisciplinary, open-access, peer-reviewed journal focusing on the intersection of Wikipedia and higher education. We are interested in most all of the same topics hosted on the research listserv and the newsletter, including articles about pedagogical practices, epistemology, bias, mission, and reliability. We will not charge for submission or publication, and will offer open access to readers. We will host on Open Journal Systems.

The submission deadline for the first annual volume, envisaged to appear in March 2017, is 31 December 2016.

Conferences and events
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.

Other recent publications
A list of other recent publications that could not be covered in time for this issue—contributions are always welcome for reviewing or summarizing newly published research.
 * "Wikipedia's semantics of openness: Ideas and implementations of the Internet's extended participation potentials in the context of collaborative knowledge production" (sociology dissertation in German, original title: "Die Offenheitssemantik der Wikipedia. Ideen und Verwirklichungen der erweiterten Beteiligungspotentiale des Internets im Kontext kollaborativer Wissensproduktion").
 * "Towards a (de)centralization-based typology of peer production" From the paper: "This paper proposes a typology of peer-production platforms, based on the centralisation/decentralisation levels of several of their design features. Wikipedia [presented as an example of a platform with "centralised architecture" but "decentralised governance"] is a case of governance being semi-distributed among several nodes structured in semi-centralised clusters (administrators and editors control what is accepted or rejected in case of conflict) or decentralised local networks (chapters for choosing projects and thematics on which to focus, e.g. the Wikicheese project of Wikimedia France)."
 * "Developing an annotator for Latin texts using Wikipedia" From the abstract: "Although Wikipedia is an excellent resource from which to extract many kinds of information (morphological, syntactic and semantic) to be used in NLP tasks on modern languages, it was rarely applied to perform NLP tasks for the Latin language. The work presents the first steps of the developement of a POS Tagger based on the Latin version of Wiktionary and a Wikipedia-based semantic annotator."
 * "Candidate searching and key coreference resolution for Wikification" From the abstract: "Wikification is the task to link textual mentions in a document to articles in Wikipedia. It comprises three main steps, namely, mention recognition, candidate generation, and entity linking. For candidate generation, existing methods use hyperlinks in Wikipedia or match a mention of discourse to Wikipedia article titles. They may miss the correct target entity and thus fail to link the mention to Wikipedia. In this paper, we propose to use a mention as a query and Wikipedia [sic] own search engine to look for additional candidate articles. [...] our proposed method outperforms or achieves competitive results in comparison to some state-of-the-art systems, but is simpler and uses less features."
 * "Generating article placeholders from Wikidata for Wikipedia—increasing access to free and open knowledge" From the abstract: "The major objective of this thesis is to increase the access to open and free knowledge in Wikipedia by developing a MediaWiki extension called ArticlePlaceholder. ArticlePlaceholders are content pages in Wikipedia auto-generated from information provided by Wikidata. [...] This thesis [...] includes the personas, scenarios, user-stories, non-functional and functional requirements for the requirement analysis. The analysis was done in order to implement the features needed to achieve the goal of providing more information for under-resourced languages. The implementation of these requirements is the main part of the following thesis."
 * "Analysing the Usage of Wikipedia on Twitter: Understanding Inter-Language Links" From the abstract: "In this paper, we analyse links within tweets referring to a Wikipedia of a language different from the tweet's language. [...] We find that the main cause for inter-language links is the non-existence of the article in the tweet's language. Furthermore, we observe that the quality of the tweeted articles is constantly higher in comparison to their counterparts, suggesting that users choose the article of higher quality even when tweeting in another language. Moreover, we find that English is the most dominant target for inter-language links." (See also presentation slides and our coverage of a preceding paper: "Wikipedia and Twitter".)

Wiki Workshop 2016 at International World Wide Web Conference (WWW)

 * "Wikipedia Tools for Google Spreadsheets" From the abstract: "With the Wikipedia Tools for Google Spreadsheets, we have created a toolkit that facilitates working with Wikipedia data from within a spreadsheet context. We make these tools available as open-source on GitHub, released under the permissive Apache 2.0 license." (See also meta:Wikipedia Tools for Google Spreadsheets)
 * "Assessing the Quality of Wikipedia Editors through Crowdsourcing" From the abstract: "...we propose a method for assessing the quality of Wikipedia editors. By effectively determining whether the text meaning persists over time, we can determine the actual contribution by editors. This is used in this paper to detect vandal. However, the meaning of text does not always change if a term in the text is added or removed. Therefore, we cannot capture the changes of text meaning automatically, so we cannot detect whether the meaning of text survives or not. To solve this problem, we use crowdsourcing to manually detect changes of text meaning. In our experiment, we confirmed that our proposed method improves the accuracy of detecting vandals by about 5%."
 * "Finding Structure in Wikipedia Edit Activity: An Information Cascade Approach" From the abstract: "This paper documents a study of the real-time Wikipedia edit stream containing over 6 million edits on 1.5 million English Wikipedia articles, during 2015.[...] Our findings show that by constructing information cascades between Wikipedia articles using editing activity, we are able to construct an alternative linking structure in comparison to the embedded links within a Wikipedia page. This alternative article hyperlink structure was found to be relevant in topic, and timely in relation to external global events (e.g., political activity)."
 * "With a Little Help from my Neighbors: Person Name Linking Using the Wikipedia Social Network" From the abstract: "In this paper, we present a novel approach to person name disambiguation and linking that uses a large-scale social network extracted from the English Wikipedia."
 * "Cleansing Wikipedia Categories using Centrality" From the abstract: "We propose a novel general technique aimed at pruning and cleansing the Wikipedia category hierarchy, with a tunable level of aggregation. Our approach is endogenous, since it does not use any information coming from Wikipedia articles, but it is based solely on the user-generated (noisy) Wikipedia category folksonomy itself." See also https://github.com/corradomonti/wikipedia-categories
 * "Learning Web Queries For Retrieval of Relevant Information About an Entity in a Wikipedia Category" From the abstract: "... we present a novel method to obtain a set of most appropriate queries for retrieval of relevant information about an entity from the Web. Using the body text of existing articles in a Wikipedia category, we generate a set of queries capable of fetching the most relevant content for any entity belonging to that category. We find the common topics discussed in the articles of a category using Latent Semantic Analysis (LSA) and use them to formulate the queries. Using Long Short-Term Memory (LSTM) neural network, we reduce the number of queries by removing the less sensible ones and then select the best ones out of them."
 * "On the Retrieval of Wikipedia Articles Containing Claims on Controversial Topics" From the abstract: "This work presents a novel claim-oriented document retrieval task. For a given controversial topic, relevant articles containing claims that support or contest the topic are retrieved from a Wikipedia corpus."
 * "Automatic Discovery of Emerging Trends using Cluster Name Synthesis on User Consumption Data" From the abstract: "Technically it is possible for the telecommunication companies to recommend suitable advertisements if they can classify the web sites browsed by their customers into classes like sports, e-commerce, social networking, streaming media etc. Another problem is to classify a new website when it doesn't belong to any of the existing clusters. In this paper, the authors are going to propose a method to automatically classify the websites and synthesize the cluster names in case it doesn't belong to any of the predefined clusters. [...] This proposed system uses the Wikipedia data [from articles about such websites] to construct the document for the websites browsed by the customers."
 * "Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata" From the abstract: "In this paper, we address the quality of taxonomic hierarchies in Wikidata. We focus on taxonomic hierarchies with entities at different classification levels (particular individuals, types of individuals, types of types of individuals, etc.). We use an axiomatic theory for multi-level modeling to analyze current Wikidata content, and identify a significant number of problematic classification and taxonomic statements. The problems seem to arise from an inadequate use of instantiation and subclassing in certain Wikidata hierarchies."