Thursday, March 31, 2005

XML Compression

Today i had a presentation on XML compression and it turned out to be good. We prepared well and in the end were able to answer most of the questions which students had , the presentation had the following things:

Motivation for XML Compression
Techniques for achieving XML compression
XMill
XMill Architecture


Structured nature of XML makes it understandable to humans.
Downside: XML is verbose
Each non-empty element tag must end with a matching closing tag -- data
Ordering of tags is often repeated in a document (e.g. multiple records)

XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied
Can gain a significant savings from compression, due to highly structured nature of XML
XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents



Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values)
Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression


Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip)
Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression


ACDABA
Since these are 6 characters, this text is 6 bytes or 48 bits long
tree is build that replaces the symbols by shorter bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111
01101110100 (ACDABA = 11 bits)


Lempel-Ziv 77 algorithm
Dictionary is a portion of encoded sequence
The encoder examines the input sequence through a sliding window
The window consists of two parts:
a search buffer that contains a portion of the recently encoded sequence, and
a look-ahead buffer that contains the next portion of the sequence to be encoded.


Relies heavily on zlib, the compression library used in gzip
Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API)
During compression, each XML tag is examined to see which compression technique(s) should be applied


View XML as a tree
Separate the tree structure and what is stored in leaves
Save the tree structure so that it can be restored
The compressed file may or may not remember the tree structure


XMill applies 3 principles during compression:
Separate structure (element tags and attribute names) from data
Group related data items in a single container; compress each container separately
Apply appropriate semantic compressors to each container

Start tags and attribute names are dictionary-encoded (as T1, T2, etc.)
End tags replaced with ‘/’ token
Data values replaced with their container number


SAX Parser
sends tokens to the path processor.

Path Processor
determines how to map data values to containers.

Semantic Compressors
compresses the input and copies it to the container – in the memory window.

No comments: