Wednesday, November 29, 2006

XmlDiff

I had to compare two xml files today. This is trickier than you would at first think. Luckily Microsoft has a neat tool, XmlDiff that makes it quite easy. It can compare two files, or two XmlReaders or even fragments. Also it can spit out an xml diffgram so that you can examine and re-apply any changes. Just use it like this...

[Test]
public void XmlDiffTest()
{
    string source = "<root><child1>some text</child1><child2>more text</child2></root>";
    // note some whitespace, child nodes in different order, comments
    string target = "<root> <!-- I'm a comment --> <child2>more text</child2> " + 
        "<child1>some text</child1>  </root>"; 

    XmlReader expected = XmlReader.Create(new StringReader(source));
    XmlReader actual = XmlReader.Create(new StringReader(target));
    StringBuilder differenceStringBuilder = new StringBuilder();
    XmlWriter differenceWriter = XmlWriter.Create(new StringWriter(differenceStringBuilder));

    XmlDiff diff = new XmlDiff(XmlDiffOptions.IgnoreChildOrder |
        XmlDiffOptions.IgnoreComments |
        XmlDiffOptions.IgnoreWhitespace);

    bool areDifferent = diff.Compare(expected, actual, differenceWriter);
    Assert.IsTrue(areDifferent, string.Format(
        "expected response and actual response differ:\r\n{0}", differenceStringBuilder.ToString()));
}

Friday, November 24, 2006

Playing with the XmlSerializer

Have you ever worked on one of those projects where everything is a huge XML document and the code is littered with string literals containing XPath queries? It's a nasty hole to dig yourself into and it's easy to end up with very brittle and fragile application where the structure of your data is baked into hundreds of string literal XPath queries that aren't checked until run time and are a nightmare to change if your data structure changes. You loose all the benefits of OO, no refactoring or encapsulation and condemn yourself to a life of stepping through the debugger and examining the watch window trying to work out which of your hundreds of concatenated XPath queries aren't quite right.

There's a better way and that is to use xsd.exe to generate classes that match your XML schema and then deserialize your XML into the generated object graph. You can then work with .net types rather than an amorphous XML document with all the compile time type checking, intellisense and other benefits that brings. Xsd.exe comes with Visual Studio, you can easily find it and how to use it by opening the Visual Studio command prompt and typing 'xsd /?'. Serialization and deserialization is handled by the System.Xml.Serialization.XmlSerializer.

The project I'm currently working on requires our piece to communicate with a very complex web service whose WSDL describes more than 380 different types. We initially decided not to deserialize the XML because the serialization process was taking about 7 seconds, an unacceptably long time. Now of course we've dug ourselves into exactly the situation I've described above so I decided to look into the XmlSerializer in a little more depth.

The first thing I did was try and simplify the schema. Although the WSDL's XSD describes those 380 types, the message we actually send only uses a subset, so I've trimmed back the object model to just the types we actually need. It's easy to do, I just commented out the properties that we don't need and since many of these properties are complex types themselves, this often has the benefit of trimming whole branches of the object graph. Doing this I've managed to get the XmlSerialization process down to just under 5 seconds. But 5 seconds is still too long.

The next thing was to do some investigation into where the time was being taken up. I wrote a little test that timed the creation of the serializer and the seralization and deserialization process:

[Test]
public void SerializationTest()
{
	// initialize
    XmlSerializer serializer = null;
    Time("Serializer create", delegate()
    {
        serializer = new XmlSerializer(typeof(MyComplexType));
    });
    
    string inputXmlPath = GetPath(_inputFileName);
    MyComplexType myComplexType = null;

    // deserialize
    Time("Deserialize", delegate()
    {
        using(FileStream stream = new FileStream(inputXmlPath, FileMode.Open, FileAccess.Read))
        {
            myComplexType = (MyComplexType)serializer.Deserialize(stream);
        }
    });

    string outputXmlPath = GetPath(_outputFileName);

    // serialize
    Time("Serialize", delegate()
    {
        using(FileStream stream = new FileStream(outputXmlPath, FileMode.Create, FileAccess.Write))
        {
            serializer.Serialize(myComplexType, stream);
        }
    });
}

private delegate void Function();
private void Time(string description, Function function)
{
    DateTime start = DateTime.Now;
    function();
    DateTime finish = DateTime.Now;
    Console.WriteLine("{0} elapsed: {1}", description, finish - start);
}

The results were as follows:

Serializer create elapsed: 00:00:04.6040060
Deserialize elapsed: 00:00:00.6242720
Serialize elapsed: 00:00:00.1872816

So you can see that the majority of the time is taken by the construction of the serializer itself. What's it doing? I read the docs and did a bit of Googling and found this execellent series of blog posts by Scott Hanselman all about the XmlSerializer.

It turns out that the XmlSerializer emits an assembly that contains a custom serializer for your type when you call it's constructor. If you configure your tests with the following config section:

<configuration>:
  <system.diagnostics>:
    <switches>:
      <add name="XmlSerialization.Compilation" value="1"/>:
    </switches>:
  </system.diagnostics>:
</configuration>:

Then step through the code above and stop after the XmlSerializer contructor is called, you can find the .cs file in your user temp directory (on my machine that's at C:\Documents and Settings\<username>\Local Settings\Temp). You can even load it into Visual Studio, set a breakpoint and debug into it. At first I thought, OK, so I'll just create one serializer and cache it for the lifetime of the application, but after reading Scott's posts I discovered that the XmlSerializer has caching built in. Here's a little test to demonstrate:

[Test]
public void SerializerCachingTest()
{
    XmlSerializer serializer = null;

    for(int i = 0; i < 5; i++)
    {
        Time(string.Format("Creating Serializer {0}", i), delegate()
        {
            serializer = new XmlSerializer(typeof(MyComplexType));
        });
    }
}

Which spits out:

Creating Serializer 0 elapsed: 00:00:05.2907052
Creating Serializer 1 elapsed: 00:00:00
Creating Serializer 2 elapsed: 00:00:00
Creating Serializer 3 elapsed: 00:00:00
Creating Serializer 4 elapsed: 00:00:00

Cached indeed!

The next thing that concerned us was possible contention from multiple threads all trying to use the same cached XmlSerializer concurrently. I wrote a test to kick off ten deserialization requests on ten threads, time them all and time the total elapsed time of the test, here it is:

[Test]
public void ConcurrencyTest()
{
    XmlSerializer serializer = new XmlSerializer(typeof(ProcessUW));
    RunDeserializerHandler deserializerDelegate = new RunDeserializerHandler(RunDeserializer);

    Time("Total", delegate()
    {
        List asyncResults = new List();
        for(int i = 0; i < 10; i++)
        {
            asyncResults.Add(deserializerDelegate.BeginInvoke(serializer, i, null, null));
        }
        foreach(IAsyncResult asyncResult in asyncResults)
        {
            deserializerDelegate.EndInvoke(asyncResult);
        }
    });
}

delegate void RunDeserializerHandler(XmlSerializer serializer, int id);
private void RunDeserializer(XmlSerializer serializer, int id)
{
    string path = GetPath(_inputFileName);

    Time(string.Format("Deserialize {0}", id), delegate()
    {
        using(FileStream stream = new FileStream(path, FileMode.Open, FileAccess.Read))
        {
            MyComplexType myObject = (MyComplexType)serializer.Deserialize(stream);
        }
    });
}

Which gave the following result:

Deserialize 1 elapsed: 00:00:00.9333840
Deserialize 0 elapsed: 00:00:00.9333840
Deserialize 3 elapsed: 00:00:00.3422408
Deserialize 2 elapsed: 00:00:00.8556020
Deserialize 5 elapsed: 00:00:00
Deserialize 6 elapsed: 00:00:00
Deserialize 4 elapsed: 00:00:00
Deserialize 7 elapsed: 00:00:00
Deserialize 8 elapsed: 00:00:00.0155564
Deserialize 9 elapsed: 00:00:00.0155564
Total elapsed: 00:00:00.9644968

Now this is very interesting, not only is the deserialization not contentious (is that the right technical term?) since the total time of the test is only a slightly longer than the longest running individual deserialization, but the XmlSerializer also seems to recognise that it's being asked to do the same thing after the first four attempts and optimises appropriately.

So it after this investigation, it seems that we can use the XmlSerializer in a natural fashion, just constructing it where needed and deserializing / serializing as required. The first time the constructor is called will hit performance, but subsequent uses should be pretty fast. It also looks like the XmlSerializer wont become a bottleneck as our application scales. All in all pretty impressive.

Thursday, November 02, 2006

Using MemoryStream and BinaryFormatter for reuseable GetHashCode and DeepCopy functions

Here's a couple of techniques I learnt a while back to do add two important capabilities to your objects; compute a hash code and execute a deep copy. I can't find the orginal source for the hash code example, but the deep copy comes from Rockford Lhotka's CSLA. Both examples are my implementation of the basic idea. Both techniques utilise the MemoryStream and BinaryFormatter by getting the object to serialize itself to a byte array. To compute the hash code I simply use SHA1CryptoServiceProvider to create a 20 byte hash of the serialized object and get then xor an integer value from that.

public override int public override int GetHashCode()
{
    byte[] thisSerialized;
    using(System.IO.MemoryStream stream = new System.IO.MemoryStream())
    {
        new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter().Serialize(stream, this);
        thisSerialized = stream.ToArray();
    }
    byte[] hash = new System.Security.Cryptography.SHA1CryptoServiceProvider().ComputeHash(thisSerialized);
    uint hashResult = 0;
    for(int i = 0; i < hash.Length; i++)
    {
        hashResult ^= (uint)(hash[i] << i % 4);
    }
    return (int)hashResult;
}

The most common use for a hash code is to make hash tables efficient and to implement Equals(). Note, there's a one in 4,294,967,295 chance that this will provide a false equals (thanks to Richard for pointing that out to me):

public override bool Equals(object obj)
{
    if(!(obj is MyClass)) return false;
    return this.GetHashCode() == obj.GetHashCode();
}

To do a deep copy I simply get the object to serialize itself and deserialize it as a new instance. Be carefull, this technique will serialize everything in this object's graph so make sure you're aware of what is referenced by it and that all the objects in the graph are marked as [Serializable], Here's a generic example that you can reuse in any object that needs deep copy:

public T DeepCopy<T>()
{
    T snapshot;
    using(System.IO.MemoryStream stream = new System.IO.MemoryStream())
    {
        System.Runtime.Serialization.Formatters.Binary.BinaryFormatter formatter = 
            new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
        formatter.Serialize(stream, this);
        stream.Position = 0;
        snapshot = (T)formatter.Deserialize(stream);
    }
    return snapshot;
}

Wednesday, November 01, 2006

Nested files with 'DependentUpon' in Visual Studio

Here's a really neat trick discovered by my colleague Preet that you can use to show a relationship between files in Visual Studio, just like Microsoft does with partial 'designer' classes. You've probably noticed when you add a new windows form in VS 2005 that two files are created: MyForm.cs and MyForm.designer.cs. The designer file contains all the code that's generated by the form designer and the other file is where you write your user code. Preet showed me that you can do the same thing by modifying the .csproj file (aka the MSBuild file) and add a DependentUpon child element to any item that you want to appear below another. Here's a snippet of a .csproj that has three files, all partial classes of 'Foo'. In the solution explorer Foo.1.cs appears nested below Foo.cs and Foo.1.1.cs appears below Foo.1.cs.

  
  <ItemGroup>
    <Compile Include="Foo.1.1.cs">
      <DependentUpon>Foo.1.cs</DependentUpon>
    </Compile>
    <Compile Include="Foo.1.cs">
      <DependentUpon>Foo.cs</DependentUpon>
    </Compile>
    <Compile Include="Foo.cs" />
    <Compile Include="Program.cs" />
    <Compile Include="Properties\AssemblyInfo.cs" />
  </ItemGroup>

The only way of doing this at the moment is by editing the .csproj file, but it's pretty cool, especially if you're writing code generators and GAT tools like I have been recently where you might want generated content to be visually related to user written code.