User:GreenC/software/search wikipedia

Method to accurately search Wikipedia
Find all articles which contain the string "sportsillustrated.cnn.com" AND a template AND .. solving for complicated Wikipedia searches is trivial by downloading the Wikipedia database (dumps.wikimedia.org) and search using whatever tool you prefer. Here are two plug and play solutions.

Awk
Awk is probably the simplest language available though with a speed trade-off for lack of a real XML parser. Nevertheless, no additional software is required (awk is a POSIX tool).


 * To run: awk -f search-wp.awk > out



Note: when redirecting large output, send to a different disk (ramdisk or other physical volume) otherwise it could slow reading the XML file.

Nim
For a faster solution here is a Nim example. Nim compiles to optimized C code, which then compiles using gcc to an executable binary. In a test between Awk and Nim, it took Awk 3m31s to complete a search, the same in Nim took 0m43s. The code below is pretty much copy-paste compile and run, just add your RegEx Perl compatible regex, or plain text. Example regex strings:
 * mySearchRe = re"djvu[.]txt"
 * mySearchRe = re"http[:][^ ]*[^ ]"
 * (the regex string is wrapped by re"" )

Then download Nim compiler (choosenim method is easiest), and compile the source with.



Note: when redirecting large output, send to a different disk (ramdisk or other physical volume) otherwise it could slow reading the XML file.