User:The wub/Web2Cit

Citoid failure scenarios

 * Worst case = Breaks URL :( Privacy interstitials and the like
 * Bad = Breaks title
 * Source date - This is something really helpful to have, fails to pick up on a lot of websites
 * Author - is this something where missing is better than incorrect (especially with some of the common patterns we see)? Certainly gives a better appearance to readers. I would say author isn't that important most of the time, at least for news articles, website is more important
 * Website - ideally we could link to the relevant wiki article. But how will that work multilingually? How about Wikidata? But probably outside scope of project

Common patterns
Take a full ISO datetime and just extract date: Match transformation \d{4}-\d{2}-\d{2}


 * Split an author name into first and last
 * Decap allcaps authors
 * Multiple authors (e.g. separated by commas)
 * Convert US style (MM-DD-YYYY) date into ISO - see Anti-Defamation League for an example: regex match parts, then range (3,1,2), then join with -