User:Cmdrjameson/utf8tohtml.py

A small script that blindly reads each line as UTF-8 (or failing that, Windows Codepage 1252) and converts any non-ASCII characters to HTML entities. The input defaults to stdin. The user may specify one or more filenames as command line arguments.

Requires Python 2.3 or better.


 * 1) !/usr/bin/env python

from htmlentitydefs import codepoint2name import fileinput, re

reNonASCII = re.compile(u'[\u0080-\uffff]', re.UNICODE)

def replaceNonASCII(match): '''Replace a unicode character with a named XHTML entity if possible, and a decimal entity otherwise. Note, we do not concern ourselves with escaping 'unsafe' characters such as &, we assume the input text is already properly escaped.''' c = ord(match.group) try: return '&%s;' % codepoint2name[c] except KeyError: return '&#%d;' % c

if __name__ == '__main__': for l in fileinput.input: try: l = l.decode('utf-8') except UnicodeDecodeError: l = l.decode('windows-1252') print reNonASCII.sub(replaceNonASCII, l),