Comparing XML with XMLUnit and Character Encoding

By | September 18, 2013

XMLUnit is an excellent library for comparing XML when you want to go beyond mere string comparison and, for instance, disregard whitespace or other aspects of the XML fragments that are compared.

In my current area of interest, systems integration, the fact that two XML fragments convey the same information is not always enough. In Sweden we have three characters, å ä and ö, that are often garbled if, for instance, a string is created from a byte array using an inappropriate character encoding. This is even worse in other languages, such as Chinese. Thus the structure and the contents but also the character encoding need to be correct.

The other day I was examining the result of a unit test in which I used XMLUnit to compare an actual XML fragment with an expected ditto. Printing both of the XML fragments to the console, I noticed that both claimed to use UTF-16.

When I changed the encoding stated in the expected XML fragment, XMLUnit continued to insist that the XML fragments were still identical and still reported the encoding to be UTF-16.

Trying to Persuade XMLUnit to Report Differences in Character Encoding

I wanted XMLUnit to report differences in character encoding as a difference, along with any additional differences. To make a long story short, this turned out to be more difficult than I thought it would be. The reason for this is in the Diff class in XMLUnit which, prior to comparing two XML fragments, may strip comments, whitespace etc from the XML fragments that are to be compared. In this process, it creates in-memory copies of the DOM documents (org.w3c.dom.Document) which do not contain encoding information. I say may, because it depends on how you configure XMLUnit.
The processing of the two XML fragments take place in one of the constructors of the Diff class:

This means that the original DOM documents containing the two XML fragments to compare are not available to the Diff instance once the constructor has been exited. You could save the character encodings used by the two documents, but there are other obstacles, such as creating instances of Difference.

Experimenting

I wanted to experiment with XMLUnit and my attempts at having differences in character encoding reported from my unit tests, so I set up a small example Maven project in Eclipse. Here I will present the final version the project.

The Maven Pom File

Since this is coding for fun, I chose to use TestNG instead of JUnit. I do like my old fellow JUnit but we meet quite frequently at my job and I thought I’d call in on an old acquaintance.

The dependencies in this pom-file are not in the test-scope, since the test is the main code of this project.

Example Files

I needed a few example XML files, which I placed in src/test/resources:

Test Class

Finally, I implemented a TestNG unit test in which I conducted my experiments:

The method setUpBeforeTests is executed before each test method is invoked. In this method, XMLUnit is configured. This particular configuration will, as mentioned before, cause XMLUnit to create in-memory copies of the DOM documents containing the XML fragments to be compared. In these copies, the character encoding information will be null.

In the first test method comparedIdenticalDifferentEncoding two XML fragments that are identical except for the character encoding are compared; first the structure and contents of the XML fragments are compared using XMLUnit, then the character encoding of the documents are compared. Thus the most appropriate method I have found is to write code in the test methods that explicitly retrieve and compare the character encoding of the XML fragments that are to be compared.

The second test method compareDifferentSameEncoding shows how comparison of two XML fragments that are indeed different looks.

The compareXmlDocumentsIdentical method is a helper method that, using XMLUnit, compares the two supplied DOM documents and list any differences.

Finally the readXmlFile method reads an XML fragment in a way that will make the character encoding of the XML fragment available in the resulting DOM document.

Running the Test

When the test is run, the following is output to the console:

Note that:

  • The compareDifferentSameEncoding method reads two XML documents with the same character encoding, namely UTF-8.
  • In the compareDifferentSameEncoding method, the compared XML fragments are neither identical nor similar.
  • Four differences between the compared XML fragments are reported from the compareDifferentSameEncoding method.
    Note that the information for each difference is quite detailed, including XPath expressions specifying the XML nodes compared.
  • The comparedIdenticalDifferentEncoding method reads two XML documents with different character encoding; UTF-8 and ISO-8859-1.
  • The XML fragments compared in the comparedIdenticalDifferentEncoding method are reported to be both similar and identical.
  • No differences are listed from the comparedIdenticalDifferentEncoding method.

Conclusion

While XMLUnit is a capable library, it were not able to accommodate my desired to make comparison of character encodings part of its comparison process.
This is not a big drawback, since you can incorporate comparison of character encodings in helper methods such as compareXmlDocumentsIdentical.

Leave a Reply

Your email address will not be published. Required fields are marked *