Wikipedia:Wikipedia Signpost/2020-05-31/Recent research

Automatic detection of undisclosed paid editing
In a paper published in the proceedings of last month's (virtual) The Web Conference, four researchers from Boise State University (collaborating with an English Wikipedia administrator) present a machine learning framework for "automatically detecting Wikipedia undisclosed paid contributions, so that they can be quickly identified and flagged for removal."

Their approach is based on constructing two datasets, of articles and editors, each consisting of undisclosed paid editing (UPE; as previously confirmed by Wikipedia administrators) and a control group of articles/users assumed to be "benign" (i.e., not the result of, or engaged in, UPE). In more detail, the authors started from a previously published dataset that had collected the results of 23 past sockpuppet investigations, yielding 1,006 known UPE accounts, and added 98 manually determined UPE accounts. A sample of articles newly created in March 2019 (limited to those created by users with less than 200 edits who were manually assessed to not being engaged in paid editing) was used to come with the benign parts of the two datasets.

For both articles and editors, the authors tested three different classification algorithms (logistic regression, support vector machine, and random forest) on a relatively simple set of features (e.g., for articles, the number of categories, or for editors, the average time between two consecutive edits made by the user). Still, the resulting method appears quite effective for detecting undisclosed paid articles:  "when we combine both article and user-based features, we improve our classification results upon each group of features individually: AUROC of 0.983 and average precision of 0.913. This means that both article content and information about the account that created the article are important for detecting undisclosed paid articles."

Among the most effective features was "the percentage of edits made by a user that are less than 10 bytes. Undisclosed paid editors try to become autoconfirmed users; thus they typically make around 10 minor edits before creating a promotional article."

Overall, the results appear to hold high promise for a practical application that could be of significant assistance to the editing community in combating the abuse of Wikipedia for promotional purposes, which is an ongoing and pervasive problem (compare e.g. this month's Signpost coverage of a recent investigation on the French Wikipedia). Obviously, any output of such an algorithm would be needed to be vetted manually, considering the relatively small but (in absolute terms) still considerable number of false positives. The paper contains little discussion of possible limitations of the sockpuppet investigations dataset used (e.g., how representative it might be of UPE efforts overall, as opposed to focused on the activities of some specific PR agencies), leaving open the possibility of overfitting.

The paper also includes an analysis of the network of the articles in the dataset, with two articles connected by an edge if the same user had edited both (see figure). But its results do not appear to have been used in the detection method. Among the findings: "there is less user collaboration among positive articles [as measured by local clustering coefficient and PageRank]. UPEs only work on a limited number of Wikipedia titles that they are interested in promoting, whereas genuine users edit more pages related to their field of expertise."

The authors highlight the importance of sockpuppets, observing that "undisclosed paid editors typically act as a group of sockpuppet accounts" and basing most of their ground truth dataset on sockpuppet cases. A brief literature review covers previous research on the automatic detection of sockpuppets on Wikipedia, including a paper from the 2016 Web Conference presenting a method able "to detect 99% of fake accounts," and an earlier stylometric method (cf. our 2013 coverage: " Sockpuppet evidence from automated writing style analysis" / "New sockpuppet corpus"). An ongoing research project by the Wikimedia Foundation (presented at last year's Wikimania) concerns the practical implementation of such a tool.

Wikiworkshop 2020
As part of The Web Conference, the annual Wiki Workshop "[brought] together researchers exploring all aspects of Wikipedia, Wikidata, and other Wikimedia projects", this year held as an one-day Zoom meeting with over 100 participants. Among the papers (see also proceedings):

"A Deeper Investigation of the Importance of Wikipedia Links to the Success of Search Engines"
From the abstract:  "We find that Wikipedia links are extremely common in important search contexts, appearing in 67–84% of all SERPs &#91;search engine results pages] for common and trending queries, but less often for medical queries. Furthermore, we observe that Wikipedia links often appear in 'Knowledge Panel' SERP elements and are in positions visible to users without scrolling, although Wikipedia appears less in prominent positions on mobile devices. Our findings reinforce the complementary notions that (1) Wikipedia content and research has major impact outside of the Wikipedia domain and (2) powerful technologies like search engines are highly reliant on free content created by volunteers." See also slides

"Layered Graph Embedding for Entity Recommendation using Wikipedia in the Yahoo! Knowledge Graph"
From the abstract:  "we describe an embedding-based entity recommendation framework for Wikipedia that organizes Wikipedia into a collection of graphs layered on top of each others, learns complementary entity representations from their topology and content, and combines them with a lightweight learning-to-rank approach to recommend related entities on Wikipedia. [...]. Balancing simplicity and quality, this framework provides default entity recommendations for English and other languages in the Yahoo! Knowledge Graph, which Wikipedia is a core subset of." See also slides

"WikiHist.html: English Wikipedia's Full Revision History in HTML Format"
From the abstract:  "researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia’s full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia’s hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext, and that the missing links are important for user navigation." See also slides and the underlying 7 terabyte dataset with code

"Collaboration of Open Content News in Wikipedia: The Role and Impact of Gatekeepers"
From the abstract:  "In the current proposed study, I aim to understand this new model of content generation process through the lens of gatekeepers in social media platforms such as Wikipedia. Specifically, I aim to discover ways to identify gatekeepers and assess their impact on information quality and content polarization." See also slides

"Domain-Specific Automatic Scholar Profiling Based on Wikipedia"
From the abstract:  "to extract some properties of a given scholar, structured data, like infobox in Wikipedia, are often used as training datasets. But it may lead to serious mis-labeling problems, such as institutions and alma maters, and a Fine-Grained Entity Typing method is expected. Thus, a novel Relation Embedding method based on local context is proposed to enhance the typing performance. Also, to highlight critical concepts in selective bibliographies of scholars, a novel Keyword Extraction method based on Learning to Rank is proposed to bridge the gap that conventional supervised methods fail to provide junior scholars with relative importance of keywords." See also slides

"Matching Ukrainian Wikipedia Red Links with English Wikipedia’s Articles"
From the abstract:  "we propose a way to match red links in one Wikipedia edition to existent pages in another edition. We define the task as a Named Entity Linking problem because red link titles are mostly named entities. We solve it in a context of Ukrainian red links and English existing pages. We created a dataset of 3171 most frequent Ukrainian red links and a dataset of almost 3 million pairs of red links and the most probable candidates for the correspondent pages in English Wikipedia." See also slides

"Beyond Performing Arts: Network Composition and Collaboration Patterns"
From the abstract:  "we propose the reconstruction and analysis of the collaboration networks of performing artists registered in Wikidata. Our results suggest that different performing arts share similar collaboration patterns, as well as a mechanism of community formation that is consistent with observed social behaviors." See also slides

"Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia"
From the abstract:  "We leverage a large scale natural experiment to study how exogenous content contributions to Wikipedia articles affect the attention they attract and how that attention spills over to other articles in the network. Results reveal that exogenously added content leads to significant, substantial and long-term increases in both content consumption and subsequent contributions. Furthermore, we find significant attention spillover to downstream hyperlinked articles." See also slides

"The Positioning Matters: Estimating Geographical Bias in the Multilingual Record of Biographies on Wikipedia"
From the abstract:  "This article proposes that an appropriate assessment of the geographical bias in multilingual Wikipedia's content should consider not only the number of articles linked to places, but also their internal positioning –i.e. their location in different languages and their centrality in the network of references between articles. This idea is studied empirically, systematically evaluating the geographic concentration in the biographical coverage of globally recognized individuals (those whose biographies are found in more than 25 language versions of Wikipedia). Considering the internal positioning levels of these biographies, only 5 countries account for more than 62% of Wikipedia's biographical coverage. In turn, the inequality in coverage between countries reaches very high levels, estimated with a Gini coefficient of .84 and a Palma ratio of 207."

"Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale"
From the abstract:  "We present Citation Detective, a system designed to periodically run Citation Need models on a large number of articles in English Wikipedia, and release public, usable, monthly data dumps exposing sentences classified as missing citations. [...] We provide an example of a research direction enabled by Citation Detective, by conducting a large-scale analysis of citation quality in Wikipedia, showing that article citation quality is positively correlated with article quality, and that articles in Medicine and Biology are the most well sourced in English Wikipedia." See also code and blog post.

For coverage of some other papers from Wiki Workshop 2020, see last month's issue ("What is trending on (which) Wikipedia?"), and upcoming issues. This blog post about the event covers several non-paper aspects of the schedule, including the keynote by Jess Wade.

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
 * Submission deadlines for the call for papers for OpenSym 2020 (to be held online from 25-27 August 2020) have been extended to June 15.