User:Aarchiba/SVG sanitizer

Apparently Wikipedia does not host SVG files for fear that they will contain trojans. It is certainly true that SVG can contain JavaScript. If SVG viewers run this JavaScript in a trusted environment, then it might indeed be a security hole. If that's a problem, then the simplest solution is to just rip it right out. Here's a script to remove all tags and their contents. (Other tags are not executed, according to the SVG standard (as far as I can tell)) and so can remain.

This program reads its standard input, parses it as XML, removes any script tags and anything beneath them in the DOM tree, as well as any event attributes, and then writes an equivalent XML file to its standard output. This code does not validate against the DTD, but badly-formed XML simply causes the program to throw an exception and exit, producing no output. The XML is written in whatever character encoding is specified by the XML itself; this could easily be changed to force UTF-8. It returns a nonzero exit status if any scripts were detected.

It handles tags from other namespaces by verifying that they asre from one of a short list of namespaces; currently the only namespace from which tags are reliably removed or modified is the original SVG namespace.

This script successfully processes essentially all the non-broken files in the openclipart 0.11 release.

All this software requires is a working installation of python 2.3.