Validating XML documents with PUBLIC identifiers and catalogs

And indenting them, and changing their encoding...

powered by libxml2

To check the validity of XML files, I’ve used the stdinparse utility that comes with Xerces C for years, but no more. While creating some DITA files, I wanted to validate them using the document’s PUBLIC identifier and not its SYSTEM identifier. (I didn’t use PUBLIC identifiers much in the days between SGML and DITA. They’re useful for DITA work because the DITA Open Toolkit automates the assembly of multiple pieces, and sharing pieces in multiple places is easier with PUBLIC declarations, especially if you’re assembling a system that will run on a machine other than your own.)

I did some searches, and it turned out that I’d put the perfect utility on my Windows machine’s hard disk years ago. It also looks like it’s included in some Linux distributions as well, or only an apt-get away: xmllint, which is part of libxml2. It’s written in C, so it’s fast, and binaries are easy to find for Windows and Linux.

Once you set the SGML_CATALOG_FILES environment variable to point to your catalog, the -catalogs switch tells it to use the catalog. For example:

set SGML_CATALOG_FILES=c:/usr/local/DITA-OT1.4.1/catalog-dita.xml
xmllint -noout -valid -catalogs myditafile.xml

The -noout switch tells xmllint to not output the document itself, -valid tells it to validate the document, and -catalogs tells it to use the catalog defined in SGML_CATALOG_FILES.

xmllint has a lot of other nice switches. If you omit the -nout switch, there are some handy transformations you can easily perform on the document. You can indent it with -format, and -encode lets you specify a new encoding for the output, as Dave Holden pointed out when I described some simple XSLT stylesheets I once used to convert the encoding of XML documents. The -noblanks switch drops ignorable white space, -relaxng validates the document against a RELAX NG schema, -schema validates it against a W3C schema, and there are dozens of more switches.

I can’t believe this was sitting on my hard disk for so long without my noticing how useful it can be.

1 Comments

By Caustic Dave on January 27, 2008 11:18 AM

Oh yeah. xmllint is one of my favorite utilities. It has saved me from doom many times.

I wish there was something like it for Javascript.