User:Msm/extract.pl

info
Extract.pl is tool for extracting namespaces from wikipedia sql dump.

I modified script from wp:Database_download. Now it's generic and more easily usable.

usage
usage ./extract.pl namespace_nr [-p prefix] outfile extracts wikipedia namespace from database dump namespace_nr - namespace number prefix - database tables prefix, for table names with prefix

examples
extracting namespace 4 bzip2 -dc _cur_table.sql.bz2 | ./extract.pl 4 > help_4.sql extracting namespace 12, for use with configuration with prefix mw_ (on 6.2. 2005 for use with Mediawiki betas) bzip2 -dc _cur_table.sql.bz2 | ./extract.pl 12 -p mw_ > help_4.sql

version history

 * v0.1 - initial version
 * v0.2 - corrected typo that caused only namespace 4 was extracted (thanks 216.123.160.18)

extract.pl
# # # #  $table = 'cur'; if ($ARGV[0] eq '-h') { print "usage $0 namespace_nr [-p prefix] outfile\n"; print "extracts wikipedia namespace from database dump\n"; print "\tnamespace_nr - namespace number\n"; print "\tprefix - database tables prefix, for table names with prefix\n"; exit; } if (not $ARGV[0] =~ /\d+/) { print "first parameter must be namespace number, see $0 -h\n"; exit; } $namespace = $ARGV[0]; if ($ARGV[1] eq '-p') { $prefix = $ARGV[2]; if (not $prefix =~ /\w+/) { print "bad prefix, see $0 -h\n"; exit; }	 	$table = $prefix. $table; } while () { s/^INSERT INTO cur VALUES //gi; s/\n// if (($j++ % 2) == 0); s/(\'\d+\',\'\d+\'\)),(\(\d+,\d+,)/$1\;\n$2/gs; foreach (split /\n/) { next unless (/^\(\d+,$namespace,\'/); 		s/^\(\d+,\d+,/INSERT INTO $table \(cur_namespace,cur_title,cur_text,cur_comment,cur_user, 		cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit, 		cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \($namespace,/; 		s/\n\s+//g; 		s/$/\n/; 		print; 	} }
 * 1) !/usr/bin/perl
 * 2) v0.2
 * 1) modified script from http://en.wikipedia.org/wiki/Wikipedia:Database_download#Importing_sections_of_a_dump
 * 1) http://en.wikipedia.org/wiki/User:Msm/extract.pl
 * 1) history:
 * 2) v0.2 - corrected typo that caused only namespace 4 was extracted