Wikipedia:Link rot/URL change requests/Archives/2020/October

Racing Post (cont..)
Continuation of the above, next phase on www.racingpost.com URLs (vs. bloodsport.racingpost.com)

Conversion types found:
 * jocky_id = http://www.racingpost.com/horses/jockey_home.sd?jockey_id=14404 ->  https://www.racingpost.com/profile/jockey/14404
 * horse_id = http://www.racingpost.com/horses/horse_home_popup.sd?horse_id=603642 -> https://www.racingpost.com/profile/horse/603642
 * owner_id = http://www.racingpost.com/horses/owner_home.sd?owner_id=11141 -> https://www.racingpost.com/profile/owner/11141
 * race_id = http://www.racingpost.com/horses/result_home.sd?race_id=100112&r_date=1990-06-22&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS -> ??????
 * trainer_id = http://www.racingpost.com/horses/trainer_home.sd?trainer_id=7005 -> https://www.racingpost.com/profile/trainer/7005

Soft404 ("S404") examples to watch for:
 * -> https://www.racingpost.com/news/#newsArchiveTabs=last7DaysNews
 * -> https://www.racingpost.com/results/#results_top_tabs=re_&results_bottom_tabs=ANALYSIS
 * -> https://www.racingpost.com/?authme
 * 0-length pages
 * Others?

Approach:
 * Check all URLs (~40,000).
 * If a redirect exists and the redirect URL is not 404 or a known S404, change URL.
 * If no redirect, status 200 and not S404 leave as-is.
 * If no redirect and 404/S404, attempt to create new URL using formulas above, verify it works or not.
 * If URL is dead last resort add archive URL.
 * Use publisher for any of the "_id" URLs or anything with /horse|jockey|owner|trainer|results/. Use work for everything else.
 * Convert existing publisher and work to uniform values.

-- Green  C  03:54, 27 September 2020 (UTC)


 * Thanks for this. There are also some URLs of the form  which are now located at  . In those cases, the horse name is required in the URL (with modifications - lowercasing, removal of special characters, conversion of embedded spaces to hyphens), and the horse referred to is not necessarily the one in the article title, so the work probably can't readily be done by bot. But a list of such URLs would be useful so they can be fixed by hand. Colonies Chris (talk) 20:38, 27 September 2020 (UTC)
 * Redirects exist: try https://www.racingpost.com/profile/horse/439081 .. the bot checks for redirects and uses the redirect URL which includes the horse name. -- Green  C  20:43, 27 September 2020 (UTC)
 * Yes but the redirect doesn't handle the tab parameter ; for that, the full url including the horse name is required and the parameter of  has to change to a subfolder named  . Does the bot handle that? Colonies Chris (talk) 22:33, 27 September 2020 (UTC)
 * is a dead link (soft404).  opens a redirect to   which works which is what the bot uses. --  Green  C  23:32, 27 September 2020 (UTC)
 * Ah I see, progeny-sales tab is premium content you have to be logged in/subscribed to access it. When not logged in returns a S404  --  Green  C  23:37, 27 September 2020 (UTC)

There were 42 (!) URLs of this type that were converted:


 * Ameerat | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=533879#damTabs=dam_stories |  https://www.racingpost.com/profile/horse/533879
 * Attraction (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=578804#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/578804/attraction
 * Balanchine (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=84195#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/84195
 * Brave Inca | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=559216#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/559216
 * Camelot (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=582086#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/582086
 * Continent (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=519979#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/519979
 * Dream Ahead | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=466062#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/466062
 * Embassy (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=34045#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/34045/pass-the-peace
 * Embassy (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=467373#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/467373
 * Fasliyev | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=507326#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/507326
 * First Trump | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=1088#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/1088
 * Flakey Dove | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=55460#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/55460
 * Fleeting Spirit | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=580556#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/580556
 * Footstepsinthesand | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=84227#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/84227/glatisant
 * Gay Gallanta | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=413483#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/413483
 * Hever Golf Rose | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=438001#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/438001/sweet-rosina
 * I'll Have Another | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=791223#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/791223
 * Kris Kin | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=105814#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/105814
 * Lyric Fantasy | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=412677#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/412677
 * Mister Baileys | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=439081#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/439081
 * Mozart (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=456290#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/456290
 * Mutafaweq | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=51712#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/51712
 * Natagora | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=679626#damTabs=dam_stories |  https://www.racingpost.com/profile/horse/679626
 * Oath (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=434827#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/434827
 * Opera House (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=407374#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/407374
 * Pebbles (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=428589#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/428589
 * Petrushka (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=75702#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/75702
 * Ramruma | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=430163#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/430163
 * Red Clubs | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=482383#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/482383/two-clubs
 * Reverence (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=84513#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/84513
 * Revoque | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=4#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/4
 * Rewilding (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=408859#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/408859
 * Rite of Passage (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=634240#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/634240/dahlias-krissy
 * Rooster Booster (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=19331#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/19331
 * Russian Rhythm | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=489286#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/489286
 * Sleepytime | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=449470#damTabs=dam_stories |  https://www.racingpost.com/profile/horse/449470
 * Stravinsky (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=51246#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/51246
 * Superstar Leo | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=15435#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/15435/council-rock
 * Thunder And Roses | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=454748#damTabs=horse_relatives |  https://www.racingpost.com/profile/horse/454748
 * Tobougg | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=419735#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/419735
 * Torgau (horse) | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=435854#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/435854
 * Vinnie Roe | http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=418684#damTabs=dam_progeny_sales |  https://www.racingpost.com/profile/horse/418684

-- Green  C  23:53, 27 September 2020 (UTC)


 * Thanks. I'll sort those out by hand and flag them as subscription required. Colonies Chris (talk) 15:25, 28 September 2020 (UTC)

I've worked out that for URLs of the form, the correct destination is. The id and date can be derived from the source URL, but the course (here, "catterick") has to come from context and the digits after  are a bit of a mystery - I suspect they may correspond to the position in an alphabetical list of courses. I derived the destination URL by entering the date and course in the advanced search facility at, but I can't see any way this could be automated, unfortunately. Colonies Chris (talk) 22:06, 28 September 2020 (UTC)
 * Ah interesting. It is also the majority of links so worth pursuing (in the old "http://" set). By looking at the HTML of the archive url it contains . Then by extracting the working links can determine there are 107 known combinations. For example:


 * 1016 riyadh
 * 104 yarmouth
 * 1079 kempton-aw
 * 107 york
 * 1083 chelmsford-aw
 * 10 catterick
 * 1138 dundalk-aw
 * 1190 pakenham
 * 11 cheltenham
 * 1212 ffos-las
 * 1231 meydan

The numbers and names appear to be consistent, so now have a map. The final step is extracting the target name from the archive HTML title and finding it on the map. It probably won't always be an exact match so a fuzzy match might be required. I'll work on it hopefully this week. -- Green  C  03:06, 29 September 2020 (UTC)
 * That would be great. I've come across yet another format: ; this one seems to be fairly fixable - it should go to  . Colonies Chris (talk) 09:18, 29 September 2020 (UTC)


 * Hi I uploaded about 100 diffs using the new code with fuzzy matching and course mapping etc. Can you take a look (example). I plan on processing the remaining 2,500 or so articles tomorrow. Thank you! --  Green  C  04:32, 1 October 2020 (UTC)
 * I've checked a random selection, and found just one problem, in Daylami; it looks like the original URLs were invalid (e.g. ), so the bot has misinterpreted them. Otherwise, looking fine. I'm intrigued that the bot has rescued some citations from archives because they're not actually dead? (e.g. in Istabraq). Colonies Chris (talk) 13:32, 1 October 2020 (UTC)
 * Yes Daylami was trouble, there were a dozen or two where the URL used dmy instead of ISO. I fixed them, reprocessed and posted the diff manually, hopefully that is the only one as it is not checking for them. In Istabraq, is a standard bot function because editors will misunderstand the purpose of archive-url - it's for web archive URLs like web.archive.org or archive.today, sometimes they put the original URL thinking it automatically turns into an archive. But by taking the spot, it actually prevents bots from adding an archive URL (when the link dies). -- Green  C  13:40, 1 October 2020 (UTC)
 * Just one little quirk I've noticed; where the bot adds, it seems to consistently add it between the last and first parameters of the author name. Not fundamentally a problem, of course, but a little strange from the point of view of some later editor. Could it be placed elsewhere? Colonies Chris (talk) 16:28, 1 October 2020 (UTC)
 * Well it's not intentional or consistent like in this case it first removed newspaper then added work - it should be programmed to add it following the url, but there are some conditions (like when url is the last argument) where it will instead add it following the first argument, and if the first argument is first that's probably what happened. I'd have to see an example to know for sure what happened. --  Green  C  16:52, 1 October 2020 (UTC)
 * Here are a couple of examples: Critérium International (horse race), Triumph Hurdle. There are quite a few like those. Colonies Chris (talk) 17:38, 1 October 2020 (UTC)
 * I know this seems like an easy fix but it gets into the core functions of a library. I'd rather not go into the code and want to finish today assuming no serious problems arise. Also half are already processed and waiting to upload diffs and there is no way to determine where this happened without reprocessing and logging the entire set which would be a lot of lost work. It's a side effect of the first and last being used often in those positions and the way the software library works. For the second half I'll try telling it to follow the title instead of url maybe that will make a difference. -- Green  C  18:51, 1 October 2020 (UTC)
 * Was able to change second to last instead of second from first. --  Green  C  01:45, 2 October 2020 (UTC)


 * Missing courses in the map
 * The map above is incomplete because that is all we had on Wikipedia to test against. During the run it found 374 URLs that have no map entry and it was able to determine the course name. They comprise 75:

aqueduct	255 arlington-park	276 auteuil	205 baden-baden	207 ballingarry	? bangor-on-dee	4 brighton	7 cagnes-sur-mer	216 camden	? cartmel	9 caulfield	469 chester	13 churchill-downs	308 clonmel	177 cologne	226 compiegne	291 delaware-park	248 del-mar	444 delta-downs	? doha	1196 doomben	467 downpatrick	179 down-royal	180 dusseldorf	240 ellis-park	638 evry	? exeter	14 fair-grounds	742 fair-hill	? fakenham	18 flemington	297 folkestone	19 frankfurt	231 hawthorne	604 hollywood-park	? hoppegarten	440 huntingdon	26 kenilworth	508 kranji	794 la-zarzuela	449 le-lion-d'angers	313 lone-star-park	674 los-alamitos	1307 ludlow	34 lyon-parilly	541 market-rasen	35 monmouth-park	253 moonee-valley	299 musselburgh	16 nad-al-sheba	483 nancy	559 newton-abbot	39 parx	578 pimlico	221 pisa	284 plumpton	44 prairie-meadows	808 quakerstown	? randwick	471 rosehill	311 saint-brieuc	713 santa-anita	257 saratoga	445 sedgefield	57 southwell	61 stratford	67 taby	271 tampa-bay-downs	724 taunton	73 towcester	83 turin	? uttoxeter	84 wincanton	90 wissembourg	? worcester	101

If an ID number was found for each the bot could convert those URLs from archived to live. Probably by searching for the course on the website. -- Green  C  01:45, 2 October 2020 (UTC)

-- Green  C  01:45, 2 October 2020 (UTC)
 * Results
 * Edits to 2,202 articles
 * 8,231 changes to metadata
 * Switch 5,664 URLs from old to new site
 * Add 1,434 archive URLs
 * Add 179
 * Convert 118 bare to square links
 * Removed 4 archive URLs
 * I've edited the list of racecourses above to add the reference numbers, where I could determine them (? indicates the ones I couldn't). Colonies Chris (talk) 10:56, 5 October 2020 (UTC)
 * Wonderful! Example. Much better. Looks like 338 URLs in 107 articles converted from archives to the new live form. There are still around 200 race_id URLs unconverted but they are probably among the "?" or because the track is not identified in the archive URL title field. There were a few mistakes this run for mysterious reasons, I reverted them but probably related to site timeouts and how convoluted their headers are. Open the hood or take down the walls and find the craftwork underneath and often it's strung together in strange ways, much like the Internet. --  Green  C  17:57, 5 October 2020 (UTC)
 * That's great. I've fixed a couple of those by hand, and I'll take a look at the race_id ones to see if I can find any way of tracking down the intended pages. The search-style ones can be located by using the horse name in the search facility at , but it's often necessary to pick from several results, so that will have to be a manual process. Fortunately there aren't many of those. I suspect the 'story=' ones are gone completely and will just have to rely on archived versions. Colonies Chris (talk) 20:01, 5 October 2020 (UTC)
 * I've managed to track down the remaining course ids - Wissembourg 750; Turin 295; Quakerstown 1256; Hollywood Park 259; Fair Hill 307; Evry 208; Delta Downs 923; Camden 230; Ballingarry 372. Colonies Chris (talk) 08:30, 6 October 2020 (UTC)
 * Thanks, and done. It only fixed 22 URLs with the new map additions, adding some for delta-downs, evry, hollywood-park, quakerstown, turin and wissembourg .. none for 'camden' for example. In Lonesome Glory is old url which the archive.org header identifies as 'Camden (USA)' and URL https://www.racingpost.com/results/207/camden/1994-11-13/60294 is 404 .. so assuming any it couldn't match are legitimate 404. -- Green  C  20:15, 6 October 2020 (UTC)
 * I've tried searching for some of the  ones, and found them all so far: e.g.   --> , but that conversion only works if the racecourse name is available to the bot, so I suppose it'll just have to be hand-fixing for those. Colonies Chris (talk) 09:46, 7 October 2020 (UTC)
 * The bot has the frankfurt mapping in the collapsed box above, but the source URL is missing the date  --  Green  C  13:51, 7 October 2020 (UTC)

TV by the Numbers
When TV by the Numbers was defunct this past January, all of their TV by the Numbers ratings urls became dead urls. The main url: https://tvbythenumbers.zap2it.com now just redirect to https://tvlistings.zap2it.com/?aid=gapzap (just the TV Listings). For an example, http://tvbythenumbers.zap2it.com/2016/09/22/wednesday-final-ratings-sept-21-2016 redirects to https://tvlistings.zap2it.com/?aid=gapzap (just the TV Listings). Is it possible for a bot to fix this problem? The dead urls of TV by the Numbers affect a lot of American television series articles. — Young Forever (talk)   21:54, 25 October 2020 (UTC)
 * What would be the fix be, to a different URL at tvlistings.zap2it.com or an archive URL? -- Green  C  23:16, 25 October 2020 (UTC)
 * Or you just said dead site so I guess archives. -- Green  C  23:17, 25 October 2020 (UTC)
 * Archive urls of the dead urls of TV by the Numbers. — Young Forever (talk)   23:37, 25 October 2020 (UTC)
 * Starting to process over 8,000 articles.. this is when I go watch TV :) -- Green  C  00:22, 26 October 2020 (UTC)
 * Lol. I want to point out that they also affect the List of episodes and/or season articles as well. — Young Forever (talk)   02:12, 26 October 2020 (UTC)

Results

Completed: If you see anything it missed let me know. Good find, 43k is a lot. -- Green  C  23:29, 26 October 2020 (UTC)
 * Checked 8,107 articles containing tvbythenumbers.zap2it.com
 * Edited 7,014 articles (difference by links already archived)
 * Added archives for 43,635 URLs (cites, bare and square links)
 * Unable to find archives for 211 links, added (list avail on request)
 * Converted 13 instances of to square links with archives.
 * Face-smile.svg Thank you very much! I will let you know if I see anything that your bot missed. — Young Forever (talk)   23:54, 26 October 2020 (UTC)