User:LuisVilla/sandbox/citoidtest

List of sites tested
List of top enwiki sources drawn from overlap of WebEmpires data and ad-hoc internal query


 * 1) books.google.com
 * 2) web.archive.org
 * 3) www.imdb.com
 * 4) news.bbc.co.uk
 * 5) youtube.com
 * 6) nytimes.com
 * 7) guardian.co.uk
 * 8) webcitation.org

And a few others from just one of the two searches, to capture some sources that are widely used in particular groups or otherwise might be useful to capture: Some sites that were in a top list, but that I didn't test: Next batch of news sites to test (from WebEmpires to 20-30): Other sites from SQL query (probably from templates, so not tested in first round)
 * 1) allmusic.com
 * 2) sports-reference.com
 * 3) bbc.co.uk (in case there is a better parser for one of the sites from news.bbc.co.uk)
 * 4) amazon.com
 * 1) viaf (appears to be mostly from templates; tested anyway but both tools do basically nothing)
 * 2) toolserver.org (no standardized layout/metadata)
 * 3) stat.gov.pl (first 10-20 links I found were all 404, so hard to know what is typical)
 * 4) insee.fr (hard to know what "typical" page is on this site, but in first test page I found Citoid had a bug)
 * 5) iucnredlist.org (appears to be mostly from templates; Citoid is very minimally better in quick testing)
 * 1) time.com
 * 2) independent.co.uk
 * 3) washingtonpost.com
 * 4) billboard.com (also on internal query list)
 * 1) worldcat.org (600K links)
 * 2) http://amigo.geneontology.org/amigo (160K links)
 * 3) www.ncbi.nlm.nih.gov (145K links)
 * 4) nrhp.focus.nps.gov (145k links)
 * 5) isni-url.oclc.nl (140K links)
 * 6) scholar.google.com (126k)
 * 7) www.nlm.nih.gov (121k)
 * 8) id.loc.gov (110k)
 * 9) geonames.usgs.gov (108k)

Testing
Observations on testing:
 * Same-ish results for 2 sites (of 12)
 * 1 site (Sports Reference) basically same (no useful metadata in either case)
 * 1 site (IMDB) both tools are missing or screwing up some structured data
 * Refill is better in 5 of remaining 10:
 * 2 sites (AllMusic, Archive.org) Refill is significantly better (e.g., sees per-page structured data like author name that Citoid does not)
 * 2 sites (BBC, YouTube), Refill is slightly better because of missing publisher information in Citoid. (This could be easy to fix for Citoid.)
 * 1 site (Guardian), both attempt to get similar metadata, but Citoid's attempt has a bug that may be difficult to fix.
 * Citoid is better in 5 of remaining 10:
 * 3 sites (Google Books, Amazon, NYT), Citoid is vastly better (e.g., uses news or book template; draws structured data like ISBNs, publication dates, etc.)
 * 1 site (news.bbc.co.uk), Citoid is better (sees structured data Refill doesn't) but also has a fixable bug.
 * 1 site (webcitation), Citoid gets mediocre data but Refill is blacklisted, so Citoid is at least slightly better?