User:SuperHamster/CiteUnseen

__INDEX__

Cite Unseen is a user script that adds categorical icons to Wikipedia citations, providing readers and editors a quick initial evaluation of citations at a glance. This helps guide users on the nature and reliability of sources, and to help identify sources that may potentially be problematic or should be used with caution (key word is may - see the usage guide below).

Cite Unseen's categorization dataset currently holds over 3,400 domains in 20 categories. These categories include:
 * Perennial sources list statuses (generally reliable; marginally reliable; generally unreliable; deprecated; blacklisted)
 * Advocacy groups; books; blogs; user-generated news; editable sites; state media; news; opinion pieces; press releases; satire; social media sites; sponsored articles; tabloids; and TV and radio programs
 * Predatory journals listed on the predatory source list

Initially developed at CredCon in November 2018, Cite Unseen was jointly developed by Kevin Payravi (SuperHamster) and Josh Lim (Sky Harbor), with support from the Credibility Coalition and the Knowledge Graph Working Group. The project saw more development at Wikimedia Hackathon 2019.

Installation
The script is located at User:SuperHamster/CiteUnseen.js and can be enabled for logged-in users. Once logged in, you can add the script to your Wikipedia browsing experience by editing your [ common.js] file and adding the following line:

Cite Unseen will automatically run whenever you open a Wikipedia page.

Before using, please read the usage guidelines below. It's particularly important to keep in mind that while Cite Unseen is here to guide you, it does not evaluate context and should not justify editing decisions.

Configuration
You can configure Cite Unseen by copying and pasting the following code (collapsed) into your [ CiteUnseen-Rules.js] page, and adjusting accordingly.

The  rules specify which icons should be displayed. By default, all icon types will be shown except for resources that are considered generally reliable per the perennial sources list, to reduce clutter.

The  rules let you remove a domain from a category. Domains should be formatted as "example.com". As an example: while CiteUnseen categorizes ResearchGate links as generally unreliable per WP:RSP, some users may wish to disable this as ResearchGate links are often used as valid open access links for reliable journals. To do this, you can have: cite_unseen_domain_ignore = { "rspGenerallyUnreliable": ["researchgate.net"] }

The  lets you add domains by category. Domains should be formatted as "example.com". For example, if you wanted to categorize Wikipedia as a social media site for some reason, you could do the following:

cite_unseen_additional_domains = { "social": ["wikipedia.org"] }

The  lets you add plain strings by category. For example, if you wanted to assume that all URLs that contain "/w/" are a wiki, you could do the following: cite_unseen_additional_strings = { "editable": ["/w/"] }

Usage
Once installed, Cite Unseen will automatically analyze and annotate references you come across. When it finds a match in its categorization dataset, it will add a categorical icon (refer to the chart below). You can hover over an icon to get more details about the categorization.

Important points to keep in mind while using Cite Unseen:
 * Context matters. Sources that are considered generally unreliable can still have valid use. For example, while we typically avoid citing social media, social media posts may still be used for uncontroversial self-descriptions. And while we typically try to avoid self-published blogs and other user-generated content, they may still be acceptable when authored by established subject-matter experts (see WP:SPS for more).
 * Evaluate. The point of Cite Unseen is to highlight the nature of sources, and to prompt you to think about potential concerns with a source. Just because a source has a concerning mark does not automatically mean it is being used inappropriately. You should never justify removing or adding a source solely because of information that Cite Unseen provides; you need to do your own homework as well.
 * It does not cover everything. There is an endless trove of resources out there, and we can't categorize all of them. You'll find many citations that Cite Unseen won't mark up; this does not indicate anything other than that it either (a) does not fit in an existing category or (b) more commonly, it simply hasn't been categorized.
 * It is not always right. Cite Unseen looks at citation types and does string-matching against URLs. While generally successful, it's possible for Cite Unseen to misidentify a source.
 * Sometimes reliable sources are hosted on an unreliable site. For example, editors citing a book may link to its listing on Amazon.com, which is classified as Argentina - NO symbol.svg generally unreliable. This will cause the citation to be marked as generally unreliable even if the book itself is fine. Situations like these are something to keep in mind while investigating the usage of a source.

Classifications
Cite Unseen classifies sources into eighteen categories.

Contributing
We're always looking to expand and tune our categorizations. Please place any questions or ideas on the talk page.

If you're interested in touching the code itself:
 * GitHub Repository
 * categorized-domains.json (for categorized domains)
 * categorized-strings.json (for categorized non-domain strings)
 * Phabricator Board

Next steps
Some of the next big goals for the project:
 * Expand our lists of identified domains
 * In particular, news sites and government-controlled sources
 * WikiProject Albums/Sources (WP:A/S) (music-related sources)
 * WikiProject Video games/Sources (WP:VG/S) (video game sources)
 * The Wikipedia CiteWatch (WP:CITEWATCH)
 * New page patrol source guide (WP:NPPSG) (sources encountered by new page patrollers)
 * Support multiple language versions of Wikipedia

Technical implementation
Cite Unseen performs string matching on URLs, as well as checks for different types of citation templates, in order to identify the kind of work and any potential ideological leanings.

Cite Unseen is implemented in JavaScript. When Cite Unseen is run, it does the following:
 * Iterates through every citation in a given Wikipedia article and pulls URLs.
 * Checks each URL against a pre-defined list of domains and strings that are categorized by nature (biased, press, news, opinion piece, etc.).
 * Injects icons next to citations accordingly.