User:GreenC/BotWikiAwk

BotWikiAwk is a framework and libraries for creating and running bots on Wikipedia.

Features

 * Bot management tools compatible with bots written in any language
 * Libraries for bots written in awk
 * Non-SQL. Data files in plain-text
 * Manage batches of articles of any size, 50 for WP:BRFA or 50k+ for production runs
 * Runs using GNU parallel making full use of multi-core CPUs
 * ..or runs on the Toolforge grid across 40+ distributed computers
 * Dry-run mode, diffs can be checked out before uploading
 * Inline colorized diffs on the command-line
 * Re-run individual pages via a cached copy of the page (download wikisource once, run bot many)
 * Installs in a single directory, easily removed
 * Includes complete example bots and skeleton bots
 * Includes a general awk library developed over years of writing bots
 * Includes a command-line interface to the MediaWiki API
 * In development and private use since 2016. Public June 2018

Overview
BotWikiAwk contains two elements:
 * A library of routines for writing bots in awk
 * An integrated set of tools for running and managing bots written in any language

Why awk? Awk is a small, elegant language composed of a single binary file, the interpreter. It is a POSIX tool installed on most unix computers. The language syntax is simple and forgiving. It is usually associated with one-line scripts, but since about 2012 the GNU version has become more powerful. While not a general purpose language, awk is primarily a text processing language which is exactly what bots do. The areas that awk can not support (eg. networking) are executed through external programs.

BotWikiAwk is batch oriented. After creating a master list of articles, it then carves out batches which are assigned a unique name, called a project ID. Each utility takes as input the project ID and what action to take for the project. Projects can be any size including the full size of the master-list ie. a single project.

Requirements

 * A Wikipedia account with bot flag permissions
 * GNU awk (version 4.1+)
 * GNU wget (version 1.13+)
 * GNU parallel (sudo apt-get install parallel) - not required on Toolforge
 * openssl for login authentication (if writing to pages)
 * wdiff (sudo apt-get install wdiff) - small utility for inline diffs
 * GNU tac (sudo apt-get install tac) - small utility reverse cat

Setup
If installing on Toolforge see special instructions.


 * Download (zip) or Clone the project:
 * Create an AWKPATH environment in .bash_profile eg.
 * If on Toolforge see special instructions
 * If on Toolforge see special instructions


 * Add BotWikiAwk to the PATH eg.


 * Log out and back in so environment vars are set.
 * cd to ~/BotWikiAwk and run
 * Edit
 * Change #1) StopButton URL
 * Change #2) UserPage URL


 * Read the SETUP file for additional instructions
 * For Wikipedia edit authorization: add your OAuth key/secrets to bin/wikiget.awk -- see EDITSETUP

New bot
To create a new bot:

The path should point to a new directory,  that has not been created yet, with "botname" being the name of your bot (no spaces recommended). The path can be to anywhere, but if different from the default  directory also update   section #3 following the "mybot" example.

I find locating the bot outside the ~/BotWikiAwk directories makes it easier to upgrade BotWikiAwk later. One can simply delete everything and re-clone it (saving only the original botwiki.awk file).

It will prompt for type of bot skeleton. If the bot will be doing operations on CS1|2 templates choose #2.

Writing bot
See ~/BotWikiBot/example-bots

Running bot
In summary, the process works by running four utilities:


 * downloads a list of page titles the bot will operate on eg. 10k page titles from a category
 * creates a new project (or batch) to process eg. the first 50 pages
 * executes the bot in dry-run mode on a given project
 * to view diffs for individual pages, to see what changes the bot made
 * to re-run for individual pages
 * when satisfied the bot is running well,  again in live mode to upload changes. Repeat with larger project sizes until done.

The utility programs (wikiget, project, runbot and bug) have many options available with -h

Example bot
The easiest way to demonstrate BotWikiBot by running a real bot.

0. Create the bot using existing example, accdate, a bot for removing access-date in CS|2 templates.


 * Make the bot:
 * Copy in the pre-written example bot:
 * cd to the bot directory
 * cd to the bot directory
 * cd to the bot directory


 * All utilities only work while in the bot's home directory; with the exception of wikiget which can run anywhere.

A. Make a master list of pages to process, called an "auth" file. Here getting the list from a category, the "-c" option.




 * The file ends in  (required) and is located in the bot's meta subdirectory.
 * In this case '20181102' is today's date but it can be any identifying string of numbers or letters.
 * The "accdate" portion of the filename can also be anything, though it's helpful to use the bot name.


 * Manually edit meta/accdate20181102.auth to remove unwanted pages eg. "Template:" or "Wikipedia:" space.

B. Create (-c) a batch (called a 'project') of 50 articles to process




 * The project ID (-p) is composed of the name created in Step A (accdate20181102) followed by a "." followed by a set of numbers (00001-00050) which means line # 1 -> line #50 in the file meta/accdate20181102.auth ie. the first 50 articles to process.


 * The project ID is referenced by every utility to identify which project is being worked on.

C. Run the bot in dry-run mode



D. Look at resulting local diffs


 * Find which pages the bot modified as recorded in the "discovered" file in the meta directory




 * For each, visually check the diff with bug -dc




 * The bot can be re-run for individual pages




 * Further info available with -v shows location of data directory



E. Push changes to Wikipedia


 * If project was previously run in dry-run mode, first delete it and recreate




 * Then run in live mode (CAUTION: don't do this for the demonstration)




 * If project has never been created before just create it new and run



F. Repeat


 * Repeat steps B->F increasing the size of the batch and using the "bug -dc" to spot check diffs until confidence is high. Once confidence is high, only the last part of step E required. As can be seen each project run is a 2-step process: create the project defining its size, then run the bot on the project.