Wikipedia:Wikipedia Signpost/2019-03-31/In focus

A new project to find unreliable sources cited by Wikipedia
A few years back, while working on WikiProject Academic Journals' Journals Cited by Wikipedia (JCW) compilation, I realized we could harness the power of bots to identify a variety of unreliable sources which are cited by Wikipedia. I've dubbed the project The Wikipedia SourceWatch (or just The SourceWatch), as it aims to identify and combat unreliable sourcing, similarly to Quackwatch, which aims to identify and combat medical quackery and Retraction Watch, which reports retracted research in scientific journals.

For context, the JCW compilation takes the various journal parameters of templates found in articles, and compiles them into various lists. For example, in the following citation a bot would find Nature and then report it at WP:JCW/N7. The compilation is organized in many ways (alphabetically, by citation count, and so on) and is typically updated a few days after the 1st and 20th of each month, when database dumps are generated. Those who want a bit of history and technical details can check the main JCW page or this talk I gave in Montreal for Wikimania 2017.

The idea of using the JCW compilation to fight unreliable sourcing stewed in my mind for a while, until I finally decided to take action in August 2018. I contacted JLaTondre, who runs the bot, and together we began laying down the first bricks of The SourceWatch. The bot would look for the various journal parameters of citation templates and cross-check them against Beall's List, a list maintained by librarian Jeffrey Beall to identify predatory journals and publishers until it was taken down in 2017. Beall's List is not perfect by any means, especially if you want a list that only identifies journals that are definitely predatory, rather than journals that range from questionable to definitely predatory, but it was a good start. Since there are other efforts beyond Beall's List to identify unreliable sources in general, I expanded The SourceWatch to draw from a variety of additional sources, including circular references to Wikipedia, deprecated or generally unreliable sources, journals lying about being included in the Directory of Open Access Journals, Quackwatch's list of non-recommended periodicals, self-published sources and vanity publications, and sources from notoriously unreliable fields (which are broadly speaking the subcategories of Category:Pseudo-scholarship and a few others). While journals from Cabell's blacklist could not be included as of writing due to the exorbitant paywall, they might get included in the future.

Two main ways of using The SourceWatch exist: For example, as of writing, the article on Heinrich Albert cites Deutsche Allgemeine Zeitung, a German newspaper published from 1861 to 1945, which is categorized in Category:Propaganda > Category:Nazi propaganda > Category:Nazi newspapers. This does not mean that citing Deutsche Allgemeine Zeitung is necessarily inappropriate – the newspaper did not exclusively publish Nazi propaganda over the 84 years of its existence – but it is good to verify that we are not citing Nazi propaganda inappropriately. This can be found either by browsing WP:SOURCEWATCH, which features Deutsche Allgemeine Zeitung under the 'Propaganda' category, or through Special:WhatLinksHere/Heinrich Albert, which shows a link from WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1.
 * 1) Browsing WP:SOURCEWATCH directly. If 5 or fewer articles cite a specific publication, the links to these articles will be given. If more than 5 articles cite it, you will have to search Wikipedia to find where it is cited. This is useful to find articles which need to be updated with reliable sources, or where unreliable sources need to be removed.
 * 2) Using Special:WhatLinksHere on an article and looking for links from WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1 (or .../Questionable2, .../Questionable3, ...). This won't directly tell you which potentially unreliable publication is cited, but it will let you know that some potentially unreliable citation is cited. This is useful when you edit an article and want to make sure you are not citing bad sources. However, this method only works if 5 or fewer articles cite a specific publication.

Of course, due to the inherently subjective nature of what constitutes an unreliable source, The SourceWatch includes sources that range from questionable to definitely unreliable, but it also has a few false positives. For the questionable we have, for example, journals and publishers which may merely engage in questionable practices such as sending spam emails to researchers, but which nonetheless remain committed to scientific and academic standards. For the definitely unreliable, we have journals that literally accept anything, even SCIgen papers, if you pay them. For false positives, we have hijacked journals, which are fraudulent publications designed to have identical or similar names to established publications. Other false positives can include members of categories such as Category:Paranormal magazines, which may set out to debunk hoaxes and nonsensical claims, rather than perpetuate them. Yet another cause of false positives is that the algorithm used to find those unreliable sources is not perfect. It is designed to find typos and similar names (Journal of Science vs Journal of Science s ), but will sometimes pick up journals that are obviously (to humans) unrelated (A f rican Journal of ... vs A me rican Journal of ...). However, false positives can be manually identified, and the compilation will be updated accordingly in future bot runs. And lastly, The SourceWatch is heavily based on third party lists and will to an extent reflect the opinion of those lists' compilers, which could be inaccurate or outdated in certain cases.

I want to emphasize here just how much work JLaTondre has done on this and JCW over the nearly 10 years of the compilation. The original JCW compilation and The SourceWatch may be my ideas, but JLaTondre is the one responsible for the heavy lifting and making them a reality since 2011. I must also acknowledge the contributions of several people: Ronhjones's for their help managing the configuration pages, Tokenzero's for their help with the creation of several redirects useful to The SourceWatch, as well as the help of many people at Village Pump (technical) over the years with various matters, Galobtter in particular. Hundreds of citations were cleaned up using The SourceWatch during development, but it was only known to a handful of people due to its unpolished state. The compilation was at times plagued with a staggering number of false positives and poor presentation structure. Now, after several iterations, The SourceWatch is something that should be usable by the community at large. While there likely is still room for improvements and debates on what should or should not be listed, one no longer needs to be familiar with the intricate workings of the bot to make sense of The SourceWatch lists, or spend months playing Whac-A-Mole against false positives.

The SourceWatch does not definitely answer whether a source is unreliable. Even if a source were unreliable, it does not definitively answer whether it is appropriate to cite it either. However, The SourceWatch is a good starting point to find unreliable sources, at least those which make use of citation templates. Once they are found, the community can then critically evaluate whether or not they should be cited, leading to a better, more reliable, Wikipedia. Whether a source should be cited can be discussed at the reliable sources noticeboard, or alternatively at a relevant WikiProject's talk page, such as WikiProject Medicine for medically dubious sources, or WikiProject Physics for sources claiming to have proven aether theories.

Suggestions on how to improve The Wikipedia SourceWatch can be made at WT:SOURCEWATCH. Particularly welcomed would be suggestions for additional sources that The SourceWatch could draw from, like lists of journals lying about being indexed by reputable databases. Other efforts to identify and prevent unreliable sourcing can be found in the "other efforts" section of the WP:JCW navbox.

Notes and references

 * Notes


 * References