Wikipedia:Wikipedia Signpost/2020-12-28/Recent research

Predicting the next move in Wikipedia discussions, based on six million threads
A conference paper titled "Modeling Deliberative Argumentation Strategies on Wikipedia" presents the result of analyzing "the entire set of about six million discussions from all English Wikipedia talk pages".

The authors parsed each thread "to identify its structural components, such as turns [i.e. individual comments], users, and time stamps." The six million talk page sections are found to contain 20 million turns. Each turn is labeled with "four types of metadata [...]: the user tag, a shortcut, an in-line template, and links." The result has been published as the "Webis-WikiDiscussions-18" corpus. Interestingly, only "half of the turns are written by registered users" (i.e. anonymous users have contributed the other half of talk page comments).

With "user tag", the authors refer to a bolded word that many comments start with (but by no means all - tags were found in only 100,000 of the 20 million turns), say "Disagree", "Support", or "Conclusion". They are found to fall into 32 clusters, which "can be grouped into six categories that we see as ‘discourse acts’": Socializing, Providing evidence, Enhancing the understanding, Recommending an act, Asking a question, and Finalizing the discussion.

Separately, the authors "identified three further categories based on the user tags, which we see as relevant to ‘argumentation theory’. Each represents a relation between the turn and the topic of discussion or between the turn and another turn." These are Support, attack (e.g. a comment that starts with a bolded "Disagree ...") and neutral. It is not entirely clear to this reviewer whether the study tries to use indentation levels to identify the turn that is being responded to (adjacency pairs). A previous study had found that wrongly indented comments are very common on Wikipedia talk pages.

"Shortcut" means a shortened link to e.g. a policy or guideline (say WP:V for Verifiability). Around 7000 different shortcuts were found in the around 400,000 (of 20 million) instances that used them. The authors derive four categories from them, which they relate to framing theory: Writing quality, Verifiability and factual accuracy, Neutral point of view, and Dialogue management.

The paper explains these categories with several real-life examples (Figure 1, several of them adapted from this discussion on whether to merge the two articles Natural language processing and Computational Linguistics). E.g. the comment "Thanks for your answer" is classified as a Socializing act, with a Neutral relation, and in the Dialogue management frame.

The researchers proceed to generate a corpus of around 200,000 turns with corresponding categories (published as the "Webis-WikiDebate-18" corpus). They use it to train classifiers that try to predict a comment's categories based on its text (via commonly used text features such as word or n-gram frequencies, or "the number of characters, syllables, tokens, phrases, and sentences in a turn").

As "the ultimate goal of our research", the authors envisage a "tool [that] recommend[s] the best possible moves according to an effective strategy". They note that this will require further work to "study how to distinguish effective from ineffective discussions based on our model as well as how to learn from the strategies used in successful discussions, in order to predict the best next deliberative move in an ongoing discussion."

The authors argue that previous models of Wikipedia talk page discussions (by Ferschke et al. - cf. our previous coverage: "Understanding collaboration-related dialog in Simple English Wikipedia" - and Viegas et al.) "obtain low coverage and/or are over-abstracted". Still, "the three classifiers [constructed in the present paper] achieved results that are comparable to the results of previous methods".

The paper also provides data on the frequency of categories as labeled in their corpus, which presumably is indicative (if not perfectly representative) of their prevalence in the entire six million Wikipedia discussions. For example, the most widely used frame is about verifiability and factual accuracy, followed by Neutral point of view and dialogue management, with writing quality being the least frequent one. (However, the precision of the labeling - as evaluated by an expert - varied, e.g. reaching only 0.51 for the "writing quality" category but 0.89 for "Verifiability and factual accuracy".) (excerpt from Table 3 in the paper)

Briefly

 * See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.''

"Wikipedia: A Challenger's Best Friend? Utilising Information-seeking Behaviour Patterns to Predict US Congressional Elections"
From the abstract:  "past [efforts to use website traffic data to predict elections] have often overlooked the interaction between conventional election variables and information-seeking behaviour patterns. In this work, we aim to unify traditional and novel methodology by considering how information retrieval differs between incumbent and challenger campaigns, as well as the effect of perceived candidate viability and media coverage on Wikipedia pageviews predictive ability. In order to test our hypotheses, we use election data from United States Congressional (Senate and House) elections between 2016 and 2018. We demonstrate that Wikipedia data, as a proxy for information-seeking behaviour patterns, is particularly useful for predicting the success of well-funded challengers who are relatively less covered in the media." In an accompanying blog post, published on the day of the 2020 United States Senate elections, the authors use their model to forecast these elections: "Of the 35 seats up for re-election, we predict 16 Democrats and 19 Republican candidates to win".

"Using Wikipedia to Predict Election Outcomes: Online Behavior as a Predictor of Voting"
From the abstract:  "... because traditional poll-based predictions are inherently undermined by self-reporting biases and the intention-behavior disconnect, we can expect that information-seeking trends on widely used social media [... can help correct for some of this error and explain unique, additional variance in election results. [... We use] Wikipedia pageviews along with polling data in a synthesized model based on the results of the 2008, 2010, and 2012 US Senate general elections. Results show that Wikipedia pageviews data significantly add to the ability of poll- and fundamentals-based projections to predict election results up to 28 weeks prior to Election Day, and benefit predictions most at those early points, when poll-based predictions are weakest."

"Predicting 2016 US Presidential Election Polls with Online and Media Variables"
From the abstract:  "This chapter aims to determine if social media, Internet traffic [including Wikipedia pageviews], and traditional media data can be used to predict elections by searching for patterns between the data and poll numbers for 2016 US Republican and Democratic primaries. The results suggest that machine learning models with linear regression can produce quite accurate predictions ..."

"Extracting N-ary Facts from Wikipedia Table Clusters"
From the abstract:  "Tables in Wikipedia articles contain a wealth of knowledge that would be useful for many applications if it were structured in a more coherent, queryable form. An important problem is that many of such tables contain the same type of knowledge, but have different layouts and/or schemata. Moreover, some tables refer to entities that we can link to Knowledge Bases (KBs), while others do not. Finally, some tables express entity-attribute relations, while others [e.g. the chart positions table at Commodores_(album)] contain more complex n-ary relations. We propose a novel knowledge extraction technique that tackles these problems. [...] Our experiments over 1.5M Wikipedia tables show that our clustering can group many semantically similar tables. This leads to the extraction of many novel n-ary relations." The authors have published the code of their "takco" system, and note that "you can use it to extend Wikidata with information from Wikipedia tables."

"Neural Relation Extraction on Wikipedia Tables for Augmenting Knowledge Graphs"
From the abstract of this short conference paper (which shares its title with the lead author's master's thesis ):  "Knowledge Graph Augmentation is the task of adding missing facts to an incomplete knowledge graph to improve its effectiveness in applications such as web search and question answering. State-of-the-art methods rely on information extraction from running text, leaving rich sources of facts such as tables behind. We help close this gap with a neural method that uses contextual information surrounding a table in a Wikipedia article to extract relations between entities appearing in the same row of a table or between the entity of said article and entities appearing in the table." Related blog post by one of the authors: "Neural Relation Extraction on Wikipedia Tables"

"What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus"
From the abstract of this preprint by four researchers from IBM Research's AI division:  "what makes a term worthy of entering this edifice of knowledge, and having a page of its own in Wikipedia? To what extent is this a natural product of on-going human discourse and discussion rather than an idiosyncratic choice of Wikipedia editors? Specifically, we aim to identify such 'wiki-worthy' terms in a massive news corpus, and see if this can be done with no, or minimal, dependency on actual Wikipedia entries. We suggest a five-step pipeline for doing so, providing baseline results for all five, and the relevant datasets for benchmarking them."

"Notable Site Recognition using Deep Learning on Mobile and Crowd-sourced Imagery"
From the abstract (see also a related non-paywalled preprint ):  "... we design a mobile system that can automatically recognise sites of interest and project relevant information to a user that navigates the city [somewhat similar to Google Lens]. We build a collection of notable sites using Wikipedia and then exploit online services such as Google Images and Flickr to collect large collections of crowd-sourced imagery describing those sites. These images are then used to train minimal deep learning architectures that can be effectively deployed to dedicated applications on mobile devices. [...] We show how curating the training data through the application of a class-specific image de-noising method and the incorporation of information such as user location, orientation and attention patterns can allow for significant improvement in classification accuracy." The authors have published an iOS implementation of their app ("Aurama") on GitHub; sadly without releasing the code under a free license.

"VideoCutTool - Online Video Editor Tool for Wikimedia Commons"
From the abstract:  "The image CropTool allows users to crop images present on Wikimedia Commons without leaving the Wikimedia family of sites within a web environment. We implemented the same workflow for videos with the VideoCutTool. [...] This paper talks about the features of VideoCutTool and its implementation." See also a Google Summer of Code report about the same project.

"Indigenizing Wikipedia: Student Accountability to Native American Authors on the World's Largest Encyclopedia"
From this paper about a 2013 course project that involved e.g. the creation of the article Trace DeMeyer:  "Wikipedia changes to reflect not only changing facts, like shifting national borders; it has the potential, at least, to reflect shifting intellectual paradigms. In this respect, wikis are not unlike oral traditions, which in Native communities still carry enormous weight, even—interestingly—when it comes to preserving and transmitting literary history. There are writers who are revered within their tribes and beyond [...] and yet they have yet to attract attention from university-based scholars or mainstream publishers. Wikipedia offers one space in which writers with the skills, access and time can mediate between Native authors and powerful editors to improve the representation of Native culture and history. When I call this an exercise in student 'accountability,' [...] I mean our accountability to indigenous people’s own ideas of 'notability' and value: that we vet projects with them beforehand, that we consult actively with them as we try to represent their point of view, and perhaps even [...] that we decline to publish if the work doesn’t meet with their approval."