User:MZMcBride/Dumps

A few quick notes about dealing with database dumps.

The dumps are available in a compressed format (bz2, generally). There's almost never any reason to decompress these dumps and doing so is an enormous waste of space. Keep them compressed.

If you need to read the files, use bzcat and head. For example:

If you need to iterate through the files in a programming language (e.g. Python), never try to read the files into memory. Instead, the easiest solution is generally to stream the files from bzcat into your script by reading sys.stdin. For example:

The contents of scan-dump.py could look something like this: