Utterances of a Zimboe

Programming the Internet.

I Wish: Efficient XML Encodings for XMPP

with 5 comments

I’ve been thinking this for a moment.. I’d very much like to see a JEP for efficient XML encoding, in spirit of Stream Compression. Compressing XML will save you some bandwidth (e.g. 91%) but it doesn’t cut the total processing cycles; on the contrary, it adds a few. A binary encoding does save some space already by itself, but a more important factor is that it is much easier to parse. Considering current transfer speeds, I believe that cutting the parsing and serialization times by 65…95% may translate to a greater impact in the total response time compared to only squeezing the bandwidth. In addition, the binary encoding may be compressed as well, resulting in a double effect.

(A side note for keeping this post concrete: as an example of the proposals, there’s an open source implementation available for Fast Infoset specification, by an ill named project FI. The spec is supported even in my favorite JEE server, GlassFish.)

These two optimizations, compression and encoding, work on different layers: compression is transparent to the XML parser – i.e. parsing works independently of the compression – but the encoding provides a potential hit to the parser itself. A ‘hit’ as in that it requires little more profound upgrading of the software systems, but I’m considering the XML parsers pretty much one of the primary aches of those anyway. (Although Woodstox has really helped there; many thanks to Tatu Saloranta for creating such a gem.) Of course, the binary format can be re-encoded to the human-readable format and thus be fed to the same old parser, but then there’s no much benefit for using the binary format. Except that it can provide a smooth transition path.

Furthermore, I believe (while lacking evidence) that the binary encoding could be less error prone. According to my own experience, error handling is a major issue with XML streams. The state model of XML parsers is pretty complicated (or what do you think, Tatu?) which makes it pretty hard to handle and recover from errors. I would kind of like to keep the wire states simple, while it’s less critical with documents. This error proneness however remains mere speculation until further experience is gained.

Some people like to keep all XML in the human readable format, but I’m willing to consider other encodings for some uses; wire protocol being definitely one of them. I like to have some documents and database tables to be readable, but there are so many uses for XML: it’s good to note the value of the XML data model as well, in addition to the mere data encoding.

Having a human readable wire protocol is like running your production servers in a debug mode, slowing them down remarkably. The gains just doesn’t cover the costs. And, fortunately, there exists tools like Ethereal Wireshark that can represent various wire payloads very nicely. It’s a good tool and uses the de facto way of tapping into the wire anyway. So the human readable representation is available at no cost (as in effort) if you use the standard tooling of the industry. (Of course, the Wireshark may not be able to decode various XML encodings right at this moment, but I believe it wouldn’t take too long.)

The people so keen to the human readable format are a bit ‘black and white’ kind of persons, don’t you think? No sensitivity for contexts, seeing no colors of the world. Having a choice should always be a good thing! Isn’t it part of the whole open source ideology? Bad solutions just fade away.

Hmm, now I only wonder how to get Peter read this post…

(Note also the follow-up.)


cool hit counter


Written by Janne Savukoski

September 13, 2006 at 8:20 am

Posted in Technology

5 Responses

Subscribe to comments with RSS.

  1. I wouldn’t wait for psa to do it (that could be a very long wait). The good news is anyone can write a JEP and submit it.

    The biggest problem is settling for one specific XML compression format though, not the JEP. This will require some research first (espc. which libraries can handle streaming XML)


    September 13, 2006 at 10:13 pm

  2. I think peter did read it :)


    September 14, 2006 at 9:56 am

  3. This has been discussed several times in the past, but no one has put forward any real evidence that we have a problem with normal XML. People in our community working on high performance servers stated it was not an issue.

    Also consider that a message sent as uncompressed XML, in the current most used chat use cases, will usually fit in one ethernet frame.


    September 14, 2006 at 10:03 am

  4. Yup, if it ain’t broken..

    But got me thinking.. as most of the messages should fit into a single frame, won’t it make the compression less important? But, on the other hand, with low traffic pretty much nothing matters; including encoding. On high performance part (s2s), won’t the frames get pretty filled? Including more than one message, that is. In which case the compression should provide some value.

    The parsing, on the other hand, is separate from the transport, i.e. it decreases RTT even if the messages should be transported in private frames. While the primary application area of XMPP is human communication (for now, at least) and a few milliseconds don’t matter, this would help significantly (hundreds of percents) if the throughput bottleneck were the XML processing stages.

    How about large MUCs? Won’t they generate a lot of traffic? At least IRC-servers are experiencing somewhat heavy traffic.

    Janne Savukoski

    September 14, 2006 at 2:14 pm

  5. I, for one, see a big need for Fast Infoset XMPP.

    The XMPP has quickly transformed into multipurpose XML routing infrastructure and talking about human-to-human IM is beside the point. XMPP is used in mobile devices for RMI and location-based services, and here, if the parsing / generation / size of the XMPP message can be reduced to 50%, then we can get twice as many location updates from mobile clients.

    Also, talking about high performance servers is also somewhat non-issue, because here you can just add more hardware if the latency is not a big issue.

    Tero Keski-Valkama

    April 26, 2010 at 11:14 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: