User:Rjwilmsi/CiteCompletion

=Summary=

What CiteCompletion is
CiteCompletion is a script that completes fields within citations to common English-language news sites on the English Wikipedia. It works by taking the news article URL from the Wikipedia article page, looking up the news page and extracting the missing details of the news article based on per-site rules.

It is written by Rjwilmsi and normally run under the account of RjwilmsiBot as a bot task.

It operates only on sites that it has been specifically configured to work on, see the supported sites list below.

It can complete the following fields in citation templates such as cite news, cite web and citation:
 * title
 * date
 * author (using last, first etc.)
 * location
 * accessdate
 * work

It will also tag dead links if not already tagged with dead link using RjwilmsiBot, and set yes for those within citation templates, still only for those sites on the supported sites list below.

What CiteCompletion is not

 * It does not modify or update fields where they are already set.
 * It does not handle non-English news sites, nor sites not listed in the supported sites list below.
 * It does not modify non-templated manually formatted citations (because it cannot interpret the existing data so may overwrite user-set data).
 * It has only been designed for use on the English Wikipedia; it may not work anywhere else.

Compatibility

 * CiteCompletion is fully compatible with the Harvard referencing system.
 * Authors/titles with accented characters are supported.

Availability
CiteCompletion is a Custom module for AWB written in C#. In the future it may be made generally available as a Plugin for AWB.

=Detail of functionality=

Supported citation types

 * Citation templates referencing a URL e.g. cite news, cite web, citation and cite journal).
 * Bare URLs when within tags.
 * Bare URLs with a bot generated title when within tags.

Supported template fields
CiteCompletion can complete the following fields:
 * title
 * date
 * author (using last, first etc.)
 * location
 * accessdate
 * work

Assess citations
Each of the Supported citation types on the Wikipedia article is assessed for a URL matching one of the Supported sites. If a match is found a check is made to see if one or more of the title, date or author fields is not specified. If one or more of the fields are missing, the HTML source of the URL is fetched.
 * Where the citation matches but it is not templated, it is converted to use cite news.
 * Where the citation uses cite web it is converted to use cite news.

Parse HTML source
The HTML source is then parsed using the per-site rules. Supported parsing methods are:
 * HTML meta tag content.
 * HTML script numbered property (s.prop).
 * HTML div id/span class/p class.
 * Custom regex (matching a span, heading or script value etc.).

Insert parameter values
When a match is found the source match is tidied up:
 * HTML-escaped characters are converted to Unicode.
 * Quotes are trimmed from titles (not quotes within the title).
 * Smart quotes are converted to straight quotes.
 * All UPPERCASE or lowercase titles and author names are converted to Title Case.
 * Newlines are replaced with spaces.
 * Locations, job titles are removed from author names.
 * Authors are split to "Lastname, Firstname" format.
 * Publication dates are stripped of timestamps and days of the week and converted to the predominant format used in the Wikipedia article (International, American or ISO, falling back to ISO if there is no predominant format).

The tidied up value is then appended to the citation. Values are not updated, they are only added if missing:
 * title is set as found.
 * date is set as found.
 * author is set using last, first or last1 and last2 etc. for multiple authors.
 * location is set from the XML settings if relevant.
 * accessdate is set to the current date.
 * work is set from the XML settings if relevant (first checks that publisher etc. is not set).

Date format
The date format used for inserted dates (both date and accessdate) is either "2011-01-15", or "15 January 2011" or "January 15, 2011". The decision is:
 * Follow use dmy dates or use mdy dates if present.
 * Otherwise count existing date usage in article and use the majority one.
 * Otherwise, if no majority default to "2011-01-15" format (avoids accusation of any American/International bias).

Completion

 * An edit summary is generated with counts of how many fields were completed.

Per-site rules
For each supported site a set of rules are available in an XML settings file. The rules determine how to extract the template fields for each news site supported (e.g. for news.bbc.co.uk date is stored under the OriginalPublicationDate meta value).

Supported sites

 * news.bbc.co.uk
 * nytimes.com
 * time.com
 * guardian.co.uk
 * timesonline.co.uk
 * independent.co.uk
 * telegraph.co.uk
 * thestar.com
 * washingtonpost.com
 * cnn.com
 * usatoday.com
 * latimes.com
 * newsbank.com
 * variety.com
 * news.com.au
 * smh.com.au
 * reuters.com
 * findarticles.com


 * sfgate.com
 * findarticles.com
 * theage.com.au
 * pqarchiver.com
 * boston.com
 * accessmylibrary.com
 * post-gazette.com
 * cbc.ca
 * seattletimes.nwsource.com
 * wsj.com
 * foxnews.com
 * chicagotribune.com
 * dailymail.co.uk
 * cbsnews.com
 * thesun.co.uk
 * economist.com
 * indiatimes.com
 * hindu.com


 * bizjournals.com
 * forbes.com
 * denverpost.com
 * theglobeandmail.com
 * scotsman.com
 * huffingtonpost.com
 * nzherald.co.nz
 * independent.ie
 * irishtimes.com
 * hollywoodreporter.com
 * rte.ie
 * oregonlive.com
 * seattlepi.com
 * ew.com
 * wired.com
 * pcmag.com

Others will be added over time.

Settings file
CiteCompletion uses an XML settings file of per-site rules. This file is loaded into memory once per session. The format of the file is: Notes:
 * Where there are multiple derivations for the same field, these are separated by commas.
 * Where the derivation is a custom regular expression, the derivation starts with character '@'.
 * Not all sites have rules for all fields (e.g. news.bbc.co.uk does not specify the article authors).

=Issues & limitations=

Frequent

 * Not all fields are found from all supported sites. CiteCompletion will be improved over time to correctly extract more data.
 * Only the Supported sites are supported. CiteCompletion will be improved over time to support more sites.

Infrequent

 * Authors with multiple first names or multiple surnames are not supported (script cannot determine whether for 'Name Anothername Surname' Anothername should be part of first or last). Currently such authors are ignored; solutions for including them are under investigation.

=Possible future improvements= The following are ideas that may or may not be implemented in CiteCompletion at some point in the future:
 * Release CiteCompletion as an AWB plugin.
 * Set the agency field where relevant.
 * Identify and flag news articles where registration/paid access is required.
 * [Not yet known if this is feasible or actually desirable] Allow community maintenance of XML settings file.

=Alternative & related tools=
 * WP:REFLINKS – a citation insertion script that supports all sites in a generic way.
 * An alternative to CiteCompletion: CiteCompletion handles its supported sites more thoroughly than REFLINKS and can complete existing citations whereas REFLINKS offers all site support in a more generic way (normally does not detect authors etc.) but only for bare URLs (no completion of existing templated citations).
 * User:Citation bot – a citation completion script for Scientific Journal cites (cite journal)
 * Specialised for Journal citations. Not an alternative to CiteCompletion as such.
 * WikiCite Builder – generates citations for The New York Times etc.
 * Ubiquity citation tool - a similar tool that scrapes popular sites in with jquery using ubiquity