Comparing XML with XMLUnit and Character Encoding

By | September 18, 2013
Reading Time: 6 minutes

XMLUnit is an excellent library for comparing XML when you want to go beyond mere string comparison and, for instance, disregard whitespace or other aspects of the XML fragments that are compared.

In my current area of interest, systems integration, the fact that two XML fragments convey the same information is not always enough. In Sweden we have three characters, å ä and ö, that are often garbled if, for instance, a string is created from a byte array using an inappropriate character encoding. This is even worse in other languages, such as Chinese. Thus the structure and the contents but also the character encoding need to be correct.

The other day I was examining the result of a unit test in which I used XMLUnit to compare an actual XML fragment with an expected ditto. Printing both of the XML fragments to the console, I noticed that both claimed to use UTF-16.

<!-- ?xml version="1.0" encoding="UTF-16"?>

When I changed the encoding stated in the expected XML fragment, XMLUnit continued to insist that the XML fragments were still identical and still reported the encoding to be UTF-16.

Trying to Persuade XMLUnit to Report Differences in Character Encoding

I wanted XMLUnit to report differences in character encoding as a difference, along with any additional differences. To make a long story short, this turned out to be more difficult than I thought it would be. The reason for this is in the Diff class in XMLUnit which, prior to comparing two XML fragments, may strip comments, whitespace etc from the XML fragments that are to be compared. In this process, it creates in-memory copies of the DOM documents (org.w3c.dom.Document) which do not contain encoding information. I say may, because it depends on how you configure XMLUnit.
The processing of the two XML fragments take place in one of the constructors of the Diff class:

/**
 * Construct a Diff that compares the XML in two Documents using a specific
 * DifferenceEngine and ElementQualifier
 */
public Diff(Document controlDoc, Document testDoc,
            DifferenceEngine comparator, 
            ElementQualifier elementQualifier) {
    this.controlDoc = getManipulatedDocument(controlDoc);
    this.testDoc = getManipulatedDocument(testDoc);
    this.elementQualifierDelegate = elementQualifier;
    this.differenceEngine = comparator;
    this.messages = new StringBuffer();
}

This means that the original DOM documents containing the two XML fragments to compare are not available to the Diff instance once the constructor has been exited. You could save the character encodings used by the two documents, but there are other obstacles, such as creating instances of Difference.

Experimenting

I wanted to experiment with XMLUnit and my attempts at having differences in character encoding reported from my unit tests, so I set up a small example Maven project in Eclipse. Here I will present the final version the project.

The Maven Pom File

Since this is coding for fun, I chose to use TestNG instead of JUnit. I do like my old fellow JUnit but we meet quite frequently at my job and I thought I’d call in on an old acquaintance.

<project
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.ivan.xml</groupId>
	<artifactId>xmlunitexamples</artifactId>
	<version>1.0.0-SNAPSHOT</version>

	<dependencies>
		<dependency>
			<groupId>xmlunit</groupId>
			<artifactId>xmlunit</artifactId>
			<version>1.4</version>
		</dependency>
		<dependency>
			<groupId>org.testng</groupId>
			<artifactId>testng</artifactId>
			<version>6.8.5</version>
		</dependency>
	</dependencies>
</project>

The dependencies in this pom-file are not in the test-scope, since the test is the main code of this project.

Example Files

I needed a few example XML files, which I placed in src/test/resources:

<?xml version="1.0" encoding="ISO-8859-1"?>
<tns:CommandSheet
    xmlns:tns="http://www.example.com/SpaceWarGame"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <commands>
        <move>
            <point x="3.14159E0" y="3.14159E0"/>
        </move>
        <jump>
            <destination x="3.14159E0" y="3.14159E0"/>
        </jump>
        <attack>
            <targetLocation x="3.14159E0" y="3.14159E0"/>
        </attack>
        <transmit>
            <text>String</text>
        </transmit>
    </commands>
</tns:CommandSheet>
<?xml version="1.0" encoding="UTF-8"?>
<tns:CommandSheet
    xmlns:tns="http://www.example.com/SpaceWarGame"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <commands>
        <move>
            <point x="3.14159E0" y="3.14159E0"/>
        </move>
        <jump>
            <destination x="3.14159E0" y="3.14159E0"/>
        </jump>
        <attack>
            <targetLocation x="3.14159E0" y="3.14159E0"/>
        </attack>
        <transmit>
            <text>String</text>
        </transmit>
    </commands>
</tns:CommandSheet>
<?xml version="1.0" encoding="UTF-8"?>
<tns:CommandSheet
    xmlns:tns="http://www.example.com/SpaceWarGame"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <commands>
        <move>
            <point x="3.14159E0" y="3.12159E0"/>
        </move>
        <jump>
            <destination y="3.14159E0"/>
        </jump>
        <attack>
            <targetLocation y="3.14159E0" x="3.14159E0"/>
        </attack>
        <transmit>
            <text>Strong</text>
        </transmit>
    </commands>
</tns:CommandSheet>

Test Class

Finally, I implemented a TestNG unit test in which I conducted my experiments:

package com.ivan.xmlunit;

import java.io.InputStream;
import junit.framework.Assert;
import org.custommonkey.xmlunit.DetailedDiff;
import org.custommonkey.xmlunit.XMLUnit;
import org.testng.Reporter;
import org.testng.annotations.BeforeMethod;
import org.testng.annotations.Test;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;

/**
 * Experiments with XMLUnit and character encoding in XML documents.
 * 
 * @author Ivan Krizsan
 */
public class TestXMLUnit {
    /* Constant(s): */
    /** Classpath path to UTF-8 encoded XML fragment. */
    private static final String XML_FRAGMENT_UTF8 = "/command-sheet-utf8.xml";
    /** Classpath path to ISO-8859-1 encoded XML fragment. */
    private static final String XML_FRAGMENT_ISO88591 =
        "/command-sheet-iso88591.xml";
    /**
     * Classpath path to UTF-8 encoded XML fragment which differs from the
     * first UTF-8 encoded XML fragment.
     */
    private static final String XML_FRAGMENT_UTF8_2 =
        "/command-sheet2-utf8.xml";

    /**
     * Prepares for tests by configuring XMLUnit.
     */
    @BeforeMethod
    public void setUpBeforeTests() {
        /* 
         * This configuration will cause XMLUnit to normalize the compared
         * XML fragments and create in-memory copies of the DOM documents.
         * The copies will be stripped of character encoding information.
         */
        XMLUnit.setIgnoreAttributeOrder(true);
        XMLUnit.setIgnoreComments(true);
        XMLUnit.setIgnoreWhitespace(true);
    }

    /**
     * Compares two identical XML fragments that have different character
     * encoding.
     * 
     * @throws Exception If error occurs. Indicates test failure.
     */
    @Test
    public void comparedIdenticalDifferentEncoding() throws Exception {
        Reporter.log("\n***** comparedIdenticalDifferentEncoding:", true);

        /* Read XML fragments to compare. */
        Document theUtf8Document = readXmlFile(XML_FRAGMENT_UTF8);
        Document theIsoDocument = readXmlFile(XML_FRAGMENT_ISO88591);

        boolean theIdenticalFlag =
            compareXmlDocumentsIdentical(theUtf8Document, theIsoDocument);

        Assert.assertTrue(theIdenticalFlag);

        /*
         * In order to be able to determine whether the documents
         * character encoding is the same, an explicit test need to be
         * performed in the test.
         * 
         * This cannot be accomplished with XMLUnit, since XMLUnit will
         * normalize the XML fragments to be compared when told to ignore
         * whitespace etc. In this process, in-memory DOM documents will be
         * created which have no character encoding specified.
         * XML document normalization in XMLUnit can be turned off, but then
         * we loose the ability to ignore whitespace, attribute order etc.
         * See the method getNormalizedDocument in the Diff class in XMLUnit.
         */
        String theUtf8DocumentEncoding = theUtf8Document.getXmlEncoding();
        String theIsoDocumentEncoding = theIsoDocument.getXmlEncoding();

        Assert.assertFalse(theUtf8DocumentEncoding
            .equalsIgnoreCase(theIsoDocumentEncoding));
    }

    /**
     * Compares two XML fragments that are different but have the same
     * character encoding.
     * 
     * @throws Exception If error occurs. Indicates test failure.
     */
    @Test
    public void compareDifferentSameEncoding() throws Exception {
        Reporter.log("\n***** compareDifferentSameEncoding:", true);

        /* Read XML fragments to compare. */
        Document theUtf8Document = readXmlFile(XML_FRAGMENT_UTF8);
        Document theUtf8Document2 = readXmlFile(XML_FRAGMENT_UTF8_2);

        boolean theIdenticalFlag =
            compareXmlDocumentsIdentical(theUtf8Document, theUtf8Document2);

        Assert.assertFalse(theIdenticalFlag);

        /*
         * In order to be able to determine whether the documents
         * character encoding is the same, an explicit test need to be
         * performed in the test.
         */
        String theUtf8DocumentEncoding = theUtf8Document.getXmlEncoding();
        String theUtf8Document2Encoding = theUtf8Document2.getXmlEncoding();

        Assert.assertTrue(theUtf8DocumentEncoding
            .equalsIgnoreCase(theUtf8Document2Encoding));
    }

    /**
     * Compares supplied XML documents returning flag indicating whether
     * the documents are identical or not.
     * Will not take the character encoding of the documents into consideration.
     * 
     * @param inExpected Expected XML document.
     * @param inActual Actual XML document.
     * @return True if supplied XML documents are identical, false otherwise.
     */
    private boolean compareXmlDocumentsIdentical(final Document inExpected,
        final Document inActual) {
        /*
         * Create a DetailedDiff in order to obtain a list of differences
         * between the compared XML fragments.
         */
        DetailedDiff theDetailedDiff =
            new DetailedDiff(XMLUnit.compareXML(inExpected, inActual));

        boolean theIdenticalFlag = theDetailedDiff.identical();
        boolean theSimilarFlag = theDetailedDiff.similar();

        Reporter.log("XML fragments are identical: " + theIdenticalFlag, true);
        Reporter.log("XML fragments are similar: " + theSimilarFlag, true);
        Reporter.log("Differences: ", true);
        /* The difference list is not properly typed in XMLUnit. */
        for (Object theDifference : theDetailedDiff.getAllDifferences()) {
            Reporter.log("   " + theDifference, true);
        }

        return theIdenticalFlag;
    }

    /**
     * Reads XML file at the supplied path on the classpath.
     * 
     * @param inXmlPath Path to XML file on the classpath.
     * @return DOM document containing XML file.
     * @throws Exception If error occurs reading XML file.
     */
    private Document readXmlFile(final String inXmlPath) throws Exception {
        InputStream theXmlInputStream =
            ClassLoader.class.getResourceAsStream(inXmlPath);

        InputSource theSource = new InputSource(theXmlInputStream);
        Document theXmlDocument = XMLUnit.buildControlDocument(theSource);

        Reporter.log(
            "Read document with encoding: " + theXmlDocument.getXmlEncoding(),
            true);

        return theXmlDocument;
    }
}

The method setUpBeforeTests is executed before each test method is invoked. In this method, XMLUnit is configured. This particular configuration will, as mentioned before, cause XMLUnit to create in-memory copies of the DOM documents containing the XML fragments to be compared. In these copies, the character encoding information will be null.

In the first test method comparedIdenticalDifferentEncoding two XML fragments that are identical except for the character encoding are compared; first the structure and contents of the XML fragments are compared using XMLUnit, then the character encoding of the documents are compared. Thus the most appropriate method I have found is to write code in the test methods that explicitly retrieve and compare the character encoding of the XML fragments that are to be compared.

The second test method compareDifferentSameEncoding shows how comparison of two XML fragments that are indeed different looks.

The compareXmlDocumentsIdentical method is a helper method that, using XMLUnit, compares the two supplied DOM documents and list any differences.

Finally the readXmlFile method reads an XML fragment in a way that will make the character encoding of the XML fragment available in the resulting DOM document.

Running the Test

When the test is run, the following is output to the console:

***** compareDifferentSameEncoding:
Read document with encoding: UTF-8
Read document with encoding: UTF-8
XML fragments are identical: false
XML fragments are similar: false
Differences: 
   Expected attribute value '3.14159E0' but was '3.12159E0' - comparing  at /CommandSheet[1]/commands[1]/move[1]/point[1]/@y to  at /CommandSheet[1]/commands[1]/move[1]/point[1]/@y
   Expected number of element attributes '2' but was '1' - comparing  at /CommandSheet[1]/commands[1]/jump[1]/destination[1] to  at /CommandSheet[1]/commands[1]/jump[1]/destination[1]
   Expected attribute name 'x' but was 'null' - comparing  at /CommandSheet[1]/commands[1]/jump[1]/destination[1] to  at /CommandSheet[1]/commands[1]/jump[1]/destination[1]
   Expected text value 'String' but was 'Strong' - comparing String at /CommandSheet[1]/commands[1]/transmit[1]/text[1]/text()[1] to Strong at /CommandSheet[1]/commands[1]/transmit[1]/text[1]/text()[1]

***** comparedIdenticalDifferentEncoding:
Read document with encoding: UTF-8
Read document with encoding: ISO-8859-1
XML fragments are identical: true
XML fragments are similar: true
Differences: 
PASSED: compareDifferentSameEncoding
PASSED: comparedIdenticalDifferentEncoding

===============================================
    Default test
    Tests run: 2, Failures: 0, Skips: 0
===============================================

===============================================
Default suite
Total tests run: 2, Failures: 0, Skips: 0
===============================================

Note that:

  • The compareDifferentSameEncoding method reads two XML documents with the same character encoding, namely UTF-8.
  • In the compareDifferentSameEncoding method, the compared XML fragments are neither identical nor similar.
  • Four differences between the compared XML fragments are reported from the compareDifferentSameEncoding method.
    Note that the information for each difference is quite detailed, including XPath expressions specifying the XML nodes compared.
  • The comparedIdenticalDifferentEncoding method reads two XML documents with different character encoding; UTF-8 and ISO-8859-1.
  • The XML fragments compared in the comparedIdenticalDifferentEncoding method are reported to be both similar and identical.
  • No differences are listed from the comparedIdenticalDifferentEncoding method.

Conclusion

While XMLUnit is a capable library, it were not able to accommodate my desired to make comparison of character encodings part of its comparison process.
This is not a big drawback, since you can incorporate comparison of character encodings in helper methods such as compareXmlDocumentsIdentical.

Leave a Reply

Your email address will not be published. Required fields are marked *