User talk:MarioGom/Linked media outlets

Sources?
May I ask how you created this list? It's a great list, I can see a lot of uses for it, if we can show that it's accurate. --GRuban (talk) 19:30, 2 April 2018 (UTC)
 * GRuban: It is extracted from a database dump (enwiki-20180320-externallinks.sql.gz). All domains from external links in the main namespace were extracted. Subdomains were removed (e.g. news.bbc.co.uk became bbc.co.uk). Then number of occurrences of each domain were counted and finally sorted from the most to the least frequent. That was the automated part. Then I went manually through the list from the top until 15k link count and I wrote this table. Some sites had multiple domains that could not be merged automatically (e.g. bbc.com, bbc.co.uk). I merged these manually and added their counts up.
 * There are some accuracy problems in the current version because some domains were not merged correctly (some corner cases such as per-state domains in USA and Canada, such as nj.us), but I will update it soon with fixes.
 * Since I checked domains manually, it is also possible that I missed some media outlet. In order to ensure accuracy and make it verifiable, it would be possible to add every site and not just media outlets, so people can verify that the classification was correct.
 * Note that I excluded all sports-only news site. In retrospective, I think it would still be interesting to add them too. -MarioGom (talk) 20:56, 2 April 2018 (UTC)