User:HBC Archive Indexerbot/bones

This is a potential bot operated by User:HighInBC. HighInBCBot 00:39, 9 December 2006 (UTC)

Confirmed. I am the maintainer of this bot. HighInBC (Need help? Ask me) 00:40, 9 December 2006 (UTC)

Source code in progress.

archive indexing wiki-bot idea
Idea for archive indexing wiki-bot:

Will go through all of my archives, take all of the section headings, and index them on one page.

The idea being that somebody looking for a specific section from an old link(linking to a section no longer on my userapge) will see a notice saying

If you are looking for a section that has already been archived you can look for it in the index of archived sections

This index will have every section title linked to it's location in it's archive. If the same section title is on more than one archive title then it will be given a link for each archive.

Example

 * Atomic power - Archive 1
 * Irish - Archive 2
 * RfA thanks - Archive 1 (2 sections with this name in Archive 1)
 * RfA thanks - Archive 2
 * thanks - Archive 3
 * Welcome! - Archive 1

This can also do the same for other users. Parameters would include a mask for the archives in printf format, and the target location for the result of the indexing. The user can add themselves to a category and define their parameters in HTML comments like so:

Details
This will be written in my language of choice, perl, using the perl module MediaWiki. During the creation and editing process it will be ran manually one run at a time. It will read my archives, and write to a page dedicated to testing. Once finished I will port it to a linux box where it can run as a chron task once a day. Porting should be simple as perl is very portable and the module is too.

For each user benefiting from this bot, this bot will read once for each archive page, and write one time to the index page. I consider this a negligible use of bandwidth for the service provided.

Method of reading
This is a draft plan of action, may change before implementation.


 * 1) First time with a new set of archives
 * 2) Takes mask, and uses special:export to grab the first 50 matches
 * 3) If all 50 are returned then get another 50
 * 4) If less than 50 are returned, rememeber the last one
 * 5) Add each archive to bots watchlist
 * 6) Caches data
 * 7) Future passes
 * 8) Looks at watchlist
 * 9) Using special:export it grabs any that have changed, aswell as the next 5 from the last
 * 10) If all 5 are returned get 5 more
 * 11) If less than 5 are returned, remember the last one
 * 12) Newly found archives are added to watchlist
 * 13) Caches new data, overwriting old cache if needed

This system will cache as much as possible, and will limit reading to single connections in cases where there are less than 50 archives. In future passes it will read all the relevent pages in one pass, and will not read pages that have not chagned.

While reading 50 at a time may seem like alot my reasoning is that the sql queries the mediawiki software are performing locally are cheap, and the cost of making repeated http requests are relativly high. This large grab will only happen once per set of archive when the bot forms it's starting cache. Also when requesting a page that is not there, the export command just ignores that request and returns the pages that are there.

I am not sure the best way to add large numbers of pages to your watchlist, advice would be appreciated. HighInBC (Need help? Ask me) 16:44, 9 December 2006 (UTC)