Wikipedia:Bots/Requests for approval/FABLEBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard. The result of the discussion was

FABLEBot
Operator:

Time filed: 16:18, Tuesday, June 6, 2023 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Python

Source code available: Plan to open-source, but not yet ready for release

Function overview: For every broken external reference in any English Wikipedia article, the bot will check if the page previously available at that link still exists on the web at an alternate URL. If successful, the bot will patch the reference to point to the new URL.

Links to relevant discussions (where appropriate): https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_191#Request_for_comments_on_research_study https://en.wikipedia.org/wiki/User:FABLEBot/New_URLs_for_permanently_dead_external_links

Edit period(s): Manually start a new run once every few months

Estimated number of pages affected: All articles linked from https://en.wikipedia.org/wiki/Category:Articles_with_permanently_dead_external_links

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: 1. The bot will iterate over every article linked from https://en.wikipedia.org/wiki/Category:Articles_with_permanently_dead_external_links and scrape all external links in those articles.

2. For each link tagged "permanent dead link", the bot will attempt to find the new URL of the same page that previously existed at the now broken link. More details regarding the techniques used are at https://webresearch.eecs.umich.edu/fable/

3. If the new URL for the linked page is found, the bot will replace the "permanent dead" link with the new URL. The new URL identified by the bot is expected to be wrong about 10% of the time (as per the statistics from https://en.wikipedia.org/wiki/User:FABLEBot/New_URLs_for_permanently_dead_external_links). So, as suggested in the discussion at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_191#Request_for_comments_on_research_study, for every link that it replaces, the bot will leave a "verification needed" tag.

Discussion
Per the discussion with GreenC above, I would like to see a "dry run" with a log (which can be placed in a subpage depending on size) so that a more accurate assessment of the error rate can be determined for this task. Primefac (talk) 16:11, 6 August 2023 (UTC)
 * 10% is a very high error rate, and that category is pretty big—a lot of those tags won’t get looked at for a while, and citations with incorrect URLs are extremely damaging. I think this needs to be semi-automated. Really cool idea by the way. Snowmanonahoe (talk · contribs · typos) 02:06, 9 June 2023 (UTC)
 * Completely agree that having citations point to incorrect URLs is very bad, and we very much welcome any alternate proposals for how our bot should edit pages to include the new URLs it finds.
 * Having us manually vet every new URL that the bot finds is not going to scale. So, here's an alternative that we are considering: instead of *replacing* a permanently dead link with the new URL we find, what if we *augmented* the citation to include a link to the new URL? This is similar to how the InternetArchiveBot works, which adds a link to an archived copy for any particular link that it finds to be dead, while still leaving the original link in place. Thoughts, or proposals for alternatives?
 * The error rate of 10% is also the reason why we plan to focus specifically on permanently dead links, because for all of these links currently there is no way for a user to access the linked content: the original link does not work, and there is no archived copy. So, even though 10% of the new links will be wrong, fixing 90% of permanently dead citations is perhaps better than leaving 100% of them as broken? HarshaMadhyastha (talk) 17:16, 9 June 2023 (UTC)
 * The augmenting thing is not really the same, because archives are 100% deterministically proven to be the correct URL. We have both URLs not because of a chance the archive url is wrong, but because it's helpful for readers to see what the original location of the website is, unless the website has been usurped (or in some cases, the site is still up, and an archive is just there).
 * Even though it would fix 90% of permanently dead citations, that's great and all, but correct citations are the, not the goal. 10% of 178,345 is 17,835 citations that will point to the wrong webpage. If it points to the wrong live web page, which as far as I understand always happens, the result is a citation that doesn't really support what it's next to, confusing the reader and making the statement or possibly the whole article seem like nonsense, which is much worse than simply not having a citation, or having an inaccessible one.
 * My idea is to do something like IABot's OAuth functionality, where you can have it run on certain pages. Even better would be if you could incorporate this stuff into IABot's existing tool, but I understand if that isn't feasible. Snowmanonahoe (talk · contribs · typos) 17:35, 9 June 2023 (UTC)
 * "My idea is to do something like IABot's OAuth functionality, where you can have it run on certain pages."
 * I'm not very familiar with this aspect of IABot. Can you please elaborate? Perhaps what you are proposing is that we design our bot such that a human editor of any page can specifically request for our bot to run on that page and see if it finds any replacements for the permanently dead links on the page, and the editor can then manually verify our edits?
 * Also, to clarify my proposal regarding augmenting links, I am envisioning that the format we would use for the new link would make it explicit that it is a prediction for the new location of the linked page, so that the user is aware that this alternate link could potentially be incorrect. But, since there is no convention for such links currently on Wikipedia, I realize this would be a big change. HarshaMadhyastha (talk) 19:13, 9 June 2023 (UTC)
 * Snowmanonahoe (talk · contribs · typos) 19:17, 9 June 2023 (UTC)
 * Thank you for the pointer. We'll take a look. HarshaMadhyastha (talk) 19:44, 9 June 2023 (UTC)
 * Having reviewed how IABot's OAuth functionality works, we can certainly implement FABLEBot to work in a similar manner. But, before I recruit a student to work on this implementation, how can we get confirmation that a bot in this form is indeed what's desired and is likely to be approved for operation? Should I modify the bot's function details above? Or, should I submit a new request for approval? HarshaMadhyastha (talk) 21:57, 19 June 2023 (UTC)
 * , you're my go-to person for archive-related stuff: does this seem like a reasonable alternate option to supplement the existing archive bots (who are the primary adders of the "permanently dead" tags)?. Primefac (talk) 09:16, 14 June 2023 (UTC)
 * Hi Primefac, thanks for checking. I am familiar with the University of Michigan group. They have done some remarkable work in this area. We have corresponded in the past. I trust them to do good work. I agree a 10% error rate is too high for a fully automatic bot. However, it won't be "10% of 178,345 is 17,835 citations" because many of them won't have a new target, in fact most won't.
 * In addition to the ideas above, I might suggest doing a "dry run" on about 3,000 pages and instead of saving the page, log the results to see how many would have been modified and extrapolate what the absolute number of bad links would be. From that it might be possible to determine how reasonable it would be to manually check every new archive. Processes can be created that make manual checking easier such as loading 50 pages into tabs then closing each tab that is right and "bookmark all tabs" that are wrong, the save the bookmark to a file which can be used to create a DB of inaccurate links which the bot can access.
 * Another idea is to save the results to the talk page, IABot uses to do this when it first started.
 * As for an OAuth application this page is a starting point. Green  C  13:30, 14 June 2023 (UTC)
 * @GreenC, thank you for the vote of confidence.
 * Sometime last year, we did an extensive analysis of a dataset of a few hundred thousands links which have no archived copies. We found that we were able to find the new URL for roughly 10% of these links.
 * So, if we consider that we found roughly 300K links tagged as permanently dead on enwiki last year, we'd expect that our bot would rewrite around 30,000 of these links.
 * When we first began thinking about this bot last year, we proposed starting off by posting the proposed URL replacements in talk pages. But, beyond doing this on a few hundred pages, the feedback was against us doing this. HarshaMadhyastha (talk) 19:50, 19 June 2023 (UTC)
 * It should be possible to design a tool or page that makes manual checking of links fast and easy. It doesn't edit Wikipedia, it only displays URLs and queries users if those URLs are good or not. Maybe it displays 5 URLs at a time, with a radio-button next to each for Keep or Delete then a "Save" button at the bottom, at which point it loads 5 more URLs for checking. Once the data is saved, a separate bot can update Wikipedia in batches. My bot can do this: I recommend using my bot for this part because moving URLs is a lot more complex then it seems, there is a lot to it, archive URLs, templates like webarchive, bare URLs, named references which contain URLs, etc.. I have developed the codebase for this from years of experience with this kind of work. All that would be required is three data points: the pagename, oldurl and newurl. User:HarshaMadhyastha, do you have anyone with for example Python or JS or PhP skills that could build a tool like this on Toolforge? I can provide links where to get started. -- Green  C  21:34, 20 June 2023 (UTC)
 * This sounds like a great plan, @GreenC. The data that you list -- page name, old URL, and new URL -- matches what we had on the page that we compiled last year for assessing FABLE's accuracy. So, we definitely are able to produce the identified URL replacements in this form.
 * Let me talk to my current students to see if any of them is up for implementing the tool that you describe. Even otherwise, I'm sure I'll be able to recruit a new student for this. So, if you can please provide pointers on the relevant documentation, that would be very helpful for the student to know how to get started. Thanks! HarshaMadhyastha (talk) 01:50, 21 June 2023 (UTC)
 * Hi sorry for the late reply! The first step is create a Toolforge account . If you are using Python, I found this tutorial to be reliable and easy guide to create a new tool: . You won't need OAuth so step 4 can be skipped. There are some other tutorials for other languages in the right column. Green  C  20:29, 3 July 2023 (UTC)
 * Thank you for the pointers, @GreenC! One of my students has started working on setting things up on Toolforge. I have asked him to follow up with you here if/when he has any questions. HarshaMadhyastha (talk) 22:15, 7 July 2023 (UTC)
 * Hi @GreenC! I'm the student assisting Harsha on this. One question I have is relating to how we'll end up storing the aliases we've found. On ToolForge, is there some MySQL database we can use? Anishnya (talk) 03:20, 13 July 2023 (UTC)
 * Hi Anishnya - When you sign up for a Toolforge account that should be included: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#User_databases -- Green  C  13:48, 13 July 2023 (UTC)
 * User:Primefac I don't think this is going to be a bot, rather a Tool that users interact with that saves data to a database. This data will then be used by GreenC bot to update Wikipedia, which my bot already has approval for. This BOTREQ as such might be redundant? We might need to create a separate project page somewhere to coordinate. -- Green  C  16:42, 6 August 2023 (UTC)
 * @GreenC's assessment is right. We are no longer developing a bot that will be editing pages. As per the discussion above, we are currently developing a dashboard which will enable users to provide feedback on our proposed URL replacements. We will log the ones deemed to be correct, and GreenC will then use his bot to make those edits. HarshaMadhyastha (talk) 21:51, 9 August 2023 (UTC)
 * May I suggest we create a project page for discussion and coordination? One idea is WP:Link rot/FABLE, or in user space like User:HarshaMadhyastha/FABLE. I like the former as it may eventually become a documentation page linked from WP:Link rot. --  Green  C  00:56, 10 August 2023 (UTC)
 * That sounds good to me. Thank you, @GreenC. Once we have completed a first version of our dashboard (which should be soon), I'll create the page at WP:Link rot/FABLE, post an update there, and tag you on it. HarshaMadhyastha (talk) 02:10, 16 August 2023 (UTC)
 * @HarshaMadhyastha, how's this going? — Qwerfjkl  talk  14:43, 11 October 2023 (UTC)
 * @GreenC @Qwerfjkl I apologize that this has taken so long. The student working on this graduated, and that delayed things.
 * In any case, we finally have an initial version of the page on Toolforge: https://fable.toolforge.org/. I have bootstrapped Link rot/FABLE with a link to that page. Looking forward to your feedback. HarshaMadhyastha (talk) 05:22, 17 October 2023 (UTC)
 * Hi Harsha, looking great. I left feedback at Wikipedia talk:Link rot/FABLE. the BRFA can be closed at this point thanks. --  Green  C  15:53, 17 October 2023 (UTC)
 * ✅, marked as "withdrawn" mainly for record keeping. Primefac (talk) 16:05, 17 October 2023 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard.