User:Visviva/Bash

I'm fairly new to Bash, but if these scripts are of any use to you please feel free to use & adapt them.

If you think you can improve anything on this page, please share your ideas either here or on the Talk page.

Uncat.sh
''I find that this only processes about 300,000 lines per hour on my desktop machine. It would therefore take about 400 hours to process the entire text of Wikipedia.''


 * 1) !/bin/bash


 * 1) This is a bash script for extracting files from a EN Wikipedia XML dump.
 * 2) This script takes one argument, the name of the file it will process.
 * 3) If you know a way to make this script faster, please share.


 * 1) Make a special pipe for the file

exec 3< $1

in=0 cat=0


 * 1) Start

while read <&3 line; do

if [ "$in" -eq "1" ] then case $line in           *[[Category:* | *[[category:* | *REDIRECT* | *redirect* | *disambig* | *dis}}* | *CC}}* | *Disambig* |  *Redirect* )                in=0;;        esac   fi    title=""    title=" $(echo $line | grep ' ')"    if [ "$title" != " " ]    then       oldtitle=$PAGE_TITLE       title=$(echo $line | grep ' ' | sed -e s'@ \(.*\) @\1@ ')       export PAGE_TITLE=$title       if [ "$in" -eq "1" ]       then               echo "*[[$oldtitle]]"       fi       in=1       case $title in           *deletion* | *Deletion* )               in=0;;       esac    fi done
 * 1) Scan for categories
 * 1) Scan for title -- also tells us if the last page is over