I Wish: Efficient XML Encodings for XMPP
I’ve been thinking this for a moment.. I’d very much like to see a JEP for efficient XML encoding, in spirit of Stream Compression. Compressing XML will save you some bandwidth (e.g. 91%) but it doesn’t cut the total processing cycles; on the contrary, it adds a few. A binary encoding does save some space already by itself, but a more important factor is that it is much easier to parse. Considering current transfer speeds, I believe that cutting the parsing and serialization times by 65…95% may translate to a greater impact in the total response time compared to only squeezing the bandwidth. In addition, the binary encoding may be compressed as well, resulting in a double effect.
(A side note for keeping this post concrete: as an example of the proposals, there’s an open source implementation available for Fast Infoset specification, by an ill named project FI. The spec is supported even in my favorite JEE server, GlassFish.)
These two optimizations, compression and encoding, work on different layers: compression is transparent to the XML parser – i.e. parsing works independently of the compression – but the encoding provides a potential hit to the parser itself. A ‘hit’ as in that it requires little more profound upgrading of the software systems, but I’m considering the XML parsers pretty much one of the primary aches of those anyway. (Although Woodstox has really helped there; many thanks to Tatu Saloranta for creating such a gem.) Of course, the binary format can be re-encoded to the human-readable format and thus be fed to the same old parser, but then there’s no much benefit for using the binary format. Except that it can provide a smooth transition path.
Furthermore, I believe (while lacking evidence) that the binary encoding could be less error prone. According to my own experience, error handling is a major issue with XML streams. The state model of XML parsers is pretty complicated (or what do you think, Tatu?) which makes it pretty hard to handle and recover from errors. I would kind of like to keep the wire states simple, while it’s less critical with documents. This error proneness however remains mere speculation until further experience is gained.
Some people like to keep all XML in the human readable format, but I’m willing to consider other encodings for some uses; wire protocol being definitely one of them. I like to have some documents and database tables to be readable, but there are so many uses for XML: it’s good to note the value of the XML data model as well, in addition to the mere data encoding.
Having a human readable wire protocol is like running your production servers in a debug mode, slowing them down remarkably. The gains just doesn’t cover the costs. And, fortunately, there exists tools like Ethereal Wireshark that can represent various wire payloads very nicely. It’s a good tool and uses the de facto way of tapping into the wire anyway. So the human readable representation is available at no cost (as in effort) if you use the standard tooling of the industry. (Of course, the Wireshark may not be able to decode various XML encodings right at this moment, but I believe it wouldn’t take too long.)
The people so keen to the human readable format are a bit ‘black and white’ kind of persons, don’t you think? No sensitivity for contexts, seeing no colors of the world. Having a choice should always be a good thing! Isn’t it part of the whole open source ideology? Bad solutions just fade away.
Hmm, now I only wonder how to get Peter read this post…
(Note also the follow-up.)