User:GreenC/iabot


 * One archive URL for every citation on every wiki is limiting. For example, some people prefer a different archive provider. Even some entire wikis prefer a different archive provider (france prefers wikiwix, portugal wants another).


 * Static snapshot date. Because the IABot only accepts one URL, it also only accepts one snapshot date. This snapshot date may work for some citations but not others. Snapshot dates should reflect the access-date of the individual citation because of content drift. Each citation would in theory require a different archive URL with its own snapshot date. As one user reported on Meta July 2021:
 * Your bot gets the archive dates completely wrong, it's understandable that there could be a few months difference but it even leaves archived pages that are a few years off..


 * Databases out of sync creating link rot. The IABot database is an intermediary like Memento. Caching requires periodic flushing to stay in sync with the master (see DNS caching system). Flushing is not possible due to the scale of 100 million links, the Availability API would be too taxed. As a result, the IABot database goes increasingly out of sync, causing link rot in the archive URLs.


 * Full-auto misses problems. When generating archive URLs automatically via the Availability API, problems can arise such as soft-404s in the archives, layers of redirects that end up nowhere, etc..


 * Dead link checker misses soft-404s. The dead link checker thinks soft-404s are live links and never saves them. Sometimes it is a few percentage of links in a domain, sometimes it is the entire domain, or a substantial fraction - each domain is unique.


 * Unable to unwind archives. Sometimes a domain goes dead, we add archives. Then the domain returns live again. Now way to undo the archives.


 * Community support. The number of active users fixing links in the IABot interface is smaller then it needs to be. The expectation of users is the bot just works. The exception of the bot is users help more then they do.


 * Global live state of domains. The Global live state of a domain can be toggled through internal automated mechanisms, and by users. Experience has shown this leads to problems, entire domains set to Whitelist or Subscription. Some of this was due to the file handles problem in Unix now fixed; but problems still arise through other mechanisms. The feature is both needed in some cases, and used too much in others. About 50% of all links are whitelisted or subscription ie. skipped by the bot.


 * Logging is minimal and inaccessible. There should be logging of every action taken by IABot, in every article on every wiki to the extent it would be possible to undue/recreate changes as derived from information in the logs. The logs should be publicly accessible in text format like a web server log, or syslog, or any unix log made by any unix program see /var/log for examples. Rolled daily as needed, more than one log file as needed.


 * Linkrot can be solved with URL moves. Frequently a website will change how the URL is structured. The old URL stops working but the page is still live, at a new URL. These can be quite complex and varied and require custom coding, but there is no way to send the moved URL to IABot or for IABot to save a URL through a move.


 * IABot can not toggle url-status to usurped which is critical for hijacked domains.


 * Unable to act on domains excluded from the WaybackMachine eg. the bot adds a working archive URL for domain.com and later domain.com is excluded from the WaybackMachine at the request of owners. The now-broken links remain on Wikipedia.