Wikipedia:Bots/Requests for approval/BaranBOT 2

BaranBOT 2
Operator:

Time filed: 14:01, Monday, May 27, 2024 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Python

Source code available:

Function overview: Fix the URLs for the ECI election database.

Links to relevant discussions (where appropriate):

Edit period(s): Every six months

Estimated number of pages affected: 5050

Exclusion compliant (Yes/No): No

Already has a bot flag (Yes/No): No

Function details: The Election Commission of India has moved all of its data (except for very recent elections) to a subdomain. As a result, URLs in more than 5000 pages are now invalid and are giving a 404 error. This bot will replace URLs like  with the new URL. Simply replace  with.

Discussion
Why every six months? Primefac (talk) 18:28, 27 May 2024 (UTC)


 * In India, elections are held in 5-6 states every year. As the elections approach or conclude, the ECI moves data from previous elections to this subdomain. This means that many URLs will become invalid after each year's elections. – DreamRimmer (talk) 22:19, 27 May 2024 (UTC)
 * Apologies if this is coming across as dense, just want to make sure I'm on the same page. Let's arbitrarily say that there's an election in July 2024, and the URL for those pages starts with  since it's a "recent election". At what point will that URL get archived to the   prefix? If it is archived after the subsequent election, why not just update the URL with the new election information along with the data it represents? Primefac (talk) 15:00, 6 June 2024 (UTC)
 * The problem is that I don't know when ECI moves older election results to the old.eci URL. The recent elections, held in November 2023 in six states, were six months ago. So far, the ECI has moved three sets of election data to the old.eci domain. This suggests that they archive election data within six to ten months. For now, we can fix all these broken links, but we might need to do this again for future elections. If the BRFA folks think it's unnecessary to do this regularly (every six months), it's fine to handle it once. I'll try to submit a new BRFA in the future, and we can continue regularly if needed. – DreamRimmer (talk) 14:01, 7 June 2024 (UTC)


 * Previous discussion Link_rot/URL_change_requests. Geoblocking is preventing outside-India bots and DreamRimmer has India IP access. DreamRimmer, to caution, there are many non-obvious problems that can arise when operating on URLs. Probably the biggest is archive URLs you don't want to modify. This PCRE regex should capture only non-archive URLs (untested):
 * Also verify the new URL is working before switching, do a header check, don't assume, websites always have error rates some higher than others. Other issues might arise, most problems will show up during the first 100 or so edits. Common trouble points are url-status, and . Also links that are square and bare. It might too difficult to get all these exactly right, if you can change the main url and square URLs and verify the new URL works, that will go a long way!   --  Green  C  15:51, 8 June 2024 (UTC)
 * I would definitely be cautious to avoid any potential mistakes. – DreamRimmer (talk) 16:57, 14 June 2024 (UTC)
 * Let's see how things get on. Primefac (talk) 15:25, 27 June 2024 (UTC)
 * Let's see how things get on. Primefac (talk) 15:25, 27 June 2024 (UTC)


 * Edits. Everything worked as intended, and all the new URLs are working fine. I want to note that initially, the bot did not change the url-status for Mangolpuri Assembly constituency, Chandni Chowk Assembly constituency, and Bawana Assembly constituency, but this issue has since been resolved and now functions correctly. Pinging User:GreenC if they want to take a look. – DreamRimmer (talk) 13:41, 3 July 2024 (UTC)
 * I spot checked, don't see any problems. Can you confirm if it also modifies these types:
 * Title
 * https://eci.gov.in/files/file/4053-andhra-pradesh-2004/
 * ie. square (1) and bare (2) links. -- Green  C  17:56, 5 July 2024 (UTC)

Note: these links are georestricted to India IPs and can't be archived, or archived very well. I found an article in The Hindu that talks about it. The article quotes one our most technically knowledgeable editors, User:Nemo_bis, who said: "Nemo has studied 'geofencing' of Indian government websites in the past, and in 2020 created a proxy service for users located abroad to access Indian government websites". This might be our solution. I hope Nemo has a working proxy for the Election Commission website? -- Green  C  17:58, 5 July 2024 (UTC)


 * @GreenC, I am fixing all the links that start with https://eci.gov.in/files/file/, https://eci.gov.in/category, and https://eci.gov.in/ByeElection/. All these links are archived in a subdomain. The links for the 2023 elections of Chhattisgarh, Telangana, Rajasthan, Mizoram, and Madhya Pradesh are still working and have not been moved to the old subdomain, so I will not touch them.
 * The working links are formatted as follows: (eg.)
 * https://www.eci.gov.in/chhattisgarh-legislative-election-2023-statistical-report
 * https://www.eci.gov.in/mp-legislative-election-2023-statistical-report
 * https://www.eci.gov.in/mizoram-legislative-election-2023-statistical-report
 * The old election links are formatted as follows: (eg.)
 * https://eci.gov.in/files/file/9643-statistical-data-of-general-election-to-chhatisgarh-assembly-2018/ (now https://old.eci.gov.in/files/file/9643-statistical-data-of-general-election-to-chhatisgarh-assembly-2018/)
 * https://eci.gov.in/files/file/9685-madhya-pradesh-legislative-election-2018-statistical-report/ (now https://old.eci.gov.in/files/file/9685-madhya-pradesh-legislative-election-2018-statistical-report/)
 * https://eci.gov.in/files/file/9687-mizoram-legislative-election-2018-statistical-report/ (now https://old.eci.gov.in/files/file/9687-mizoram-legislative-election-2018-statistical-report/)
 * Other links that start with https://eci.gov.in/category and https://eci.gov.in/ByeElection/ have all been moved to the subdomain, so I will need to fix them. – DreamRimmer (talk) 14:03, 6 July 2024 (UTC)