Wikipedia:WikiProject Articles for creation/AfC Process Improvement May 2018/Copyvio solutions comparison report

We (as in the Growth Team) wanted to compare two approaches to detecting copyright violations ("copyvio"). One approach uses Google searches and is available as Earwig’s Copyvio Detector on Toolforge. The other approach uses iThenticate's API for plagiarism detection. Though commonly referred to as Turnitin, this report uses the term iThenticate, as that is the formal name of the API. iThenticate is used on several Wikipedias through User:EranBot and the corresponding CopyPatrol interface.

Caveats
This investigation sought to help inform the Growth Team's decision on which of the two approaches to adopt for their development. It was both time and resource constrained, and was aiming to discover potential outliers and/or obvious issues with any approach, rather than undertake a detailed investigation into the performance of copyvio detection in the context of the English Wikipedia.

Conclusions

 * 1) We do not have evidence that suggests that one approach is superior to the other when it comes to the ability to predict copyvio.
 * 2) * Instead, they might complement each other. We know from conversations with the community and the usage data from CopyPatrol that both approaches are being used to identify and correct copyvio. Training and evaluating a hybrid approach is outside the scope of this project.
 * 3) iThenticate is much faster, on average.
 * 4) Copyvio is a task that requires a large amount of human judgement in order to determine whether a copyright violation took place and if content needs to be deleted. False positives are frequent, but perhaps a necessity as this appears to be a task that prefers high recall (identifying all potential cases) in favour of high precision (presence of copyvio in any page where it is predicted).

Datasets
Our primary analysis uses 165 pages, consisting of 91 pages from the NewPages feed and 74 pages submitted to Articles for Creation. This dataset was gathered on or around 2018-07-15. For this dataset we gathered information on the overall scores from both approaches, the time taken to score the page, whether the page was deleted and if so for what reason, and whether the page had revisions deleted and if so for what reason. Information about page/revision deletions was gathered on 2018-08-17, thus identifying whether action was taken within roughly a month after our initial data gathering. We will refer to this as the "New Pages" dataset.

We also gathered a dataset of the most recent 250 cases on the English Wikipedia that used iThenticate from CopyPatrol's database on 2018-08-13. This dataset was limited to cases that had been checked by a human and flagged as either a true or false positive. This dataset contains several columns of data such as what revision was scored, the timestamp of the scoring, percentage of content copied and from what source, as well as the overall percentage of copied content. We will refer to this as the "iThenticate 250" dataset.

Findings


In 30% of the cases (49 pages in our New Pages dataset), both approaches agreed that there was 0% copyvio in a page. These cases can be seen as a single dot in the bottom left corner of Figure 1.

When one of the approaches scored a page above 0%, the correlation between the two approaches is relatively low ($$r_s=0.43$$, $$r_k=0.33$$). We used rank correlation in this case because while both approaches operate on a 0–100% scale the distribution of those scales are unknown, which means a traditional correlation coefficient carries little to no meaning. Figure 1 also visualises this lack of correlation as there are not clear patterns of agreement in the graph. This lack of correlation suggests that the two approaches might complement each other, but further study is needed in order to determine that.



Figure 2 visualises the cases in our iThenticate 250 dataset by showing the count of cases grouped by overall score into buckets with intervals of 5%. A key part of this graph is the large number of cases towards the right side of the graph, cases with a high overall score. Here we find most of the cases (137, or 54.8%), and we can also see that a large number of these get fixed (100, or 73%).



Figure 3 visualizes the iThenticate 250 dataset in a different way, showing the False Positive Rate (FPR) for each of the buckets. Here we see that false positives are relatively common, but we can also see that the FPR is considerably lower towards the right end of the graph. As previously mentioned, this is also where most of the cases are found. FPR is not an issue if copyvio is the type of task where one prefers high recall (identifying all potential cases) so that copyvio can be fixed, rather than have a lower FPR (also known as "higher precision") but miss some potential cases.

Lastly, we did a statistical analysis of the response time of both approaches using the data on that in our New Pages dataset. Here we found that the iThenticate approach is significantly faster. Median response time for iThenticate is 4.83s, while for EarwigBot it is 11.95s. We confirmed this to be a statistically significant difference using a Mann-Whitney U test to account for skewness and differences in distributions ($$W = 24714, p \ll 0.001$$).