User:Tzusheng/sandbox/Validation Study

The validation study of Wikibench.

Background
This validation study is part of the broader research that aims to support Wikipdians to evaluate AI systems, such as ORES and LiftWing, by curating an evaluation dataset that Wikipdians can use to compare against AI's predictions.

To support the curation of the dataset in the context of edit quality, we developed Wikibench, a wiki plug-in that enables Wikipedians to label whether an edit is damaging or saved with good or bad faith while patrolling recent changes. Wikibench also has several features that facilitate discussion when Wikipedians disagree on the label. During our pilot testing, Wikibench received overwhelmingly positive feedback from Wikipedians.

The purpose of this validation study
The last mile of this pilot testing is to validate whether the data and its label curated using Wikibench indeed reflect community consensus. Therefore, as shown in the table below, we sample a set of edits from Wikibench's dataset and invite experienced Wikipedians to collectively discuss and decide what the "correct" label of these edits should be. We'll compare the labels Wikipedians enter in this table to Wikibench's labels to understand how many of them are identical or different and to know the effectiveness of Wikibench in terms of "correctness".

How to participate
For invited Wikipedians, please enter your label in the column of your username. For edit damage, the label is either damaging or not damaging. For user intent, the label is either good faith or bad faith.

After labeling all the edits, please be bold to enter or update the consensus based on your and others' labels in the consensus column. Whenever there are disagreements, please discuss them on the talk page. Note that you don't necessarily have to wait for others' labels to enter the consensus.

The study completes when the consensus columns are filled with labels that are reflective of the consensus of the five participating experienced Wikipedians. Note that consensus is not necessarily the majority label.

Study duration
This validation study lasts about a week, from July 14 to 22, 2023. It might be extended for another week if Wikipedians find the extension helpful for building consensus.

Compensation
We'd provide $90 USD to compensate for your time, experience, and contribution. As discussed in the village pump and concluded here, this research is not the paid editing that Wikipedians and WMF are concerned with. However, you may still consider adding the following userbox for disclosure if you decide to accept our compensation.