Canonical XML for Tidy, Regular Structure

Having absolutely nothing to do with the Ubuntu Linux distribution, a somewhat forgotten subset of XML, first drafted in 1999 and fully W3C recommended by 2001, Canonical XML, can find wide use cases in a variety of XML applications, including those involving web service calls in SOAP or REST. Any piece of XML can be transformed in a piece of Canonical XML. It is essentially a normalized structure for XML where non-essential characters are removed and attributes are ordered, but the logical meaning of the XML does not change.

When encoding XML into a Canonical XML form, the essential data and values are retained and reordered into a regular structure, while the non-essential characters are removed. Extraneous white-space is removed between the different tags and between the attributes. Attributes themselves are placed in alphabetical order. Other things to consider are that CDATA sections are not encoded, entity references and default values get replaced with their equivalents, and XML single, self-closing elements are replaced with the corresponding opening and closing tag. Regardless of the changes during the transformation, the meaning of the data expressed by the XML has not changed, it has merely been normalized like a list that has been sorted. This is important and where Canonical XML gains its value. Because the meaning does not change, but the XML has a regular structure, pieces of XML can be compared to each other successfully. It is the same manner that lists are compared after being sorted. Often in XML-based applications, when comparing two pieces of XML, parsers are invoked, and direct values extracted to compare one by one. With Canonical XML, the XML does not even need to be processed by an XML parser, the application only needs to compare the strings.

An old idea revitalized with current applications, Canonical XML, is a great, low overhead, alternative to fully parsing an XML document when you only need to compare it to identify and conditionally decide what to do with it. Some programmers typically code XML comparison as they would comparison of any data structure, by comparing the values of the member variables one by one. Canonical XML is a way to structure your XML document so that you do not lose any meaning, and the normalized structure means that straight string comparisons will ensure equality.

Published
Categorized as XML

By Taylor Gillespie

Taylor is a Staff Writer for DevWebPro

Leave a comment