User:Monkbot/task 14: repair improper use of publisher params in cs1 templates

The next version of the Module:Citation/CS1 suite will emit an error messages when 'periodical' templates do not have a 'periodical' parameter. The next version will also emit error messages when italic markup is found in publisher:

Similarly, error messages will be emitted when italic markup is found in a periodical parameter:

The purpose of task 14 is to preemptively repair the most easily repairable of these types of cs1 templates before the module-suite change goes live.

description
Wiki markup is not allowed in cs1|2 parameters that are made part of a citation's metadata; documentation is typical (an exception is made for title where italic markup is allowed for proper title rendering – species names in a journal article title, for example). Further, information in publisher is not included in the metadata created for cs1 periodical templates; this is a limitation of the underlying metadata standard and not of Module:Citation/CS1. For readers who consume cs1 citations through the template's metadata, whatever information (corrupted or not) that is held in publisher is not available (even though that information is visibly rendered).

Task 14 has several sub-tasks, defined below, that attempt to correct malformed cs1 (and in some cases cs2) templates that have wiki markup in publisher and &lt;periodical alias> parameter values.

Task 14 skips pages that include.

definitions
For the purposes of this task, these definitions apply:
 * periodical template:any one of the following templates and redirects:
 * periodical alias:any one of the following parameters:
 * dictionary, encyclopedia, journal, magazine, newspaper, website

periodical list
Task 14 maintains a list of periodical names that it consults when making repairs. The list of periodical names is manually assembled according to these loose criteria:
 * 1) must be an online and / or print periodical
 * 2) must have an en.wiki article or be identified in an en.wiki article as a synonym, redirect, etc
 * 3) the en.wiki article must identify the periodical as a newspaper, magazine, etc
 * 4) those periodicals that share the name between periodical types (The Courier can be either newspaper or magazine) are excluded
 * 5) should not have the form of a domain name (there are some exceptions: 'dictionary.com' for example, and see below for templates with domain names)
 * 6) television sources are explicitly excluded
 * 7) corporate names are excluded: Bloomberg but Bloomberg Businessweek; BBC but BBC News; ESPN but ESPNscrum – the outcome of this rfc may change this criterion

sub-task 1: periodical templates with italicized publisher
This sub-task operates only on the periodical templates.

Task 14 extracts  from  &lt;periodical name>  and then consults the list of known periodicals for a match. When there is an exact match, and when there is no already-existing periodical alias in the template, task 14 replaces the publisher parameter with an appropriate periodical alias and renames the template to match.

cs2 is excluded here because it isn't always possible to know if the citation refers to a periodical or to a book. The error message emitted by Module:Citation/CS1 will notify editors that these cs2 citations need repair.

When deciding to make a fix, this sub-task will make fixes only when the entire value assigned to publisher is inside proper italic wiki markup (leading  must be balanced with trailing  ).

When fixes are made, task 14 reports the number fixes applied, the number of 'periodical' names that it does not recognize, and / or the number of templates that already have a periodical alias (conflict).

sub-task 2: periodical templates with unbalanced italicized publisher
This sub-task operates only on the periodical templates.

Essentially the same as sub-task 1 except that this sub-task catches the relatively rare case where publisher has the form:
 * '' &lt;periodical name> – no trailing or closing  markup

When fixes are made, task 14 reports the number fixes applied, the number of 'periodical' names that it does not recognize, and / or the number of templates that already have a periodical alias (conflict), all of these using the same counters as sub-task 1. During development, this sub-task maintains an unbalanced counter that may or may not remain in the code.

sub-task 3: cite web with italicized domain name in publisher
This sub-task operates only on.

For only, task 14 will repair  &lt;domain name>  where   is any combination of lowercase letters, digits, 'dot', and hyphens followed by a 'dot' and a two- or three-letter (lowercase) top-level domain (validity of   is not assessed). When this form of is encountered, and when there is no already-existing website alias, task 14 removes the wiki markup and replaces publisher with website.

When / website fixes are made, task 14 reports the number fixes applied and / or the number of templates that already have a website alias parameter (conflict).

sub-task 4: periodical templates with upright publisher
This sub-task operates only on the periodical templates.

Task 14 extracts  from &lt;periodical name> and then consults the list of known periodicals for a match. When there is an exact match, and when there are no already-existing periodical aliases in the template, task 14 replaces the publisher parameter with an appropriate periodical alias and renames the template to match.

When deciding to make a fix, task 14 will only make fixes when the entire value assigned to publisher is found in the list of known periodicals.

When fixes are made, task 14 reports the number fixes applied and / or the number of templates that already have a periodical alias (conflict).

sub-task 5: periodical templates with italicized 'work'
This sub-task operates only on the periodical templates.

Task 14 extracts  from  &lt;periodical name>  and then consults the list of known periodicals for a match. When there is an exact match, task 14 removes the italic wiki markup. Task 14 may, if appropriate, rename either or both of the template and the &lt;periodical alias> parameter.

When deciding to make a fix, task 14 will only make fixes when the entire value assigned to &lt;periodical alias> is inside proper italic wiki markup (leading  must be balanced with trailing   ). Not currently implemented, &lt;periodical alias> with unbalanced wiki markup ('' &lt;periodical name>) may be repaired in a subsequent sub-task or a revised version of this task.

When fixes are made, task 14 reports the number fixes applied.

sub-task 6: all cs1|2 templates
This sub-task operates on all cs1|2 templates.

As the last of these subtasks, task 14 will remove italic and bold markup from &lt;name>  parameters in all cs1|2 templates; this sub-task does not query the periodical list nor does it replace template or parameter names. In early runs of task 14 as a bot, this sub-task will be disabled so that the operator has the opportunity to refine the periodical list.

When these fixes are made, task 14 reports the number fixes applied.

edit summaries
Task 14 writes several short edit summary messages depending on the work it accomplished. Here's a brief description of what those messages mean:
 * cs1 template fixes : misused |publisher= (n×/n×);:always part of the edit summary, this message indicates the number of cs1 periodical templates that were modified; first number is the number for &lt;periodical name>, the second number is the number for &lt;periodical name>; numbers may be zero when none were modified [sub-task 1] [sub-task 4]
 * unbalanced (n×);:this message reports the number of publisher parameters with unbalanced italic markup (opening  without closing   to match) [sub-task 2]
 * skipped : :this text is inserted when one or both of the following messages are added to the edit summary [sub-task 1]
 * unrecognized periodical(s) (n×);:identifies the number of cs1 templates where  in  &lt;periodical name>  is not recognized (not listed in the task's list of periodicals);   may not be recognized because is hasn't been added to the list (the list is manually curated),  in the list but is not an exact match, is not a periodical, is not unique to a single journal, magazine, newspaper, website, etc. [sub-task 1]  Nota bene: the number stated in this summary will include cs1 templates later fixed by sub-task 3 so may be perceived as misleading.
 * conflicting periodical(s) (n×);:task 14 does not attempt to repair cs1 templates that have both &lt;periodical name>  and a periodical alias with an assigned value; a conflict exists because cs1|2 templates are not allowed to have more than one periodical alias [sub-task 1]


 * fixed web sites (n×);:indicates the number of templates that were modified from  &lt;domain name>  to &lt;domain name> [sub-task 3]
 * skipped conflicting website(s) (n×);:task 14 does not attempt to repair templates that have both  &lt;domain name>  and a periodical alias with an assigned value; a conflict exists because  is not allowed to have more than one periodical alias [sub-task 3]
 * fixed work aliases (n×);:indicates the number of cs1 periodical templates that were modified from &lt;periodical name>  to &lt;periodical name> when   is known to task 14 [sub-task 5]
 * removed markup from cs1|2 work aliases (n×);:indicates the number of cs1 periodical templates that were modified from &lt;periodical name>  to &lt;periodical name>; 2 templates|[sub-task 6]
 * book/cs2 skip (n×);:this message reports the number of and  templates with italic markup in publisher that task 14 has not repaired
 * ext text skip (n×);:this message reports the number of publisher parameters with 'extraneous text' that task 14 has not repaired, for example:
 * – dates, volume and issue numbers, other descriptive text; this is the most common form
 * – italic markup inside the label section of a wikilink

ancillary tasks
Deletes all empty parameters from templates that are repaired.

This task does not do awb general fixes.