Introduction
Often the most difficult types of problems we try to resolve are those where we cannot reproduce the same behavior or error internally that the customer experiences. Often this is because we simply don’t have enough information and context (or specific sample data) to create a truly analogous reproduction.
For example recently we had a customer case where a .NET XML error was seen in a dtrace log of our Sharepoint Archiving Task. The error prevented certain items or libraries from archiving. The error did not give enough information to pinpoint the problem item:
(EvSharePointArchiveTask) <8868> EV-H {SPListWalker.ListFilesInFolder} Error listing files in the folder [folder] System.InvalidOperationException: There is an error in XML document (1, 63651). ---> System.Xml.XmlException: '|', hexadecimal value 0x0B, is an invalid character. Line 1, position 63651.| at System.Xml.XmlTextReaderImpl.Throw(Exception e)| at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)| at System.Xml.XmlTextReaderImpl.Throw(Int32 pos, String res, String[] args)| at System.Xml.XmlTextReaderImpl.ThrowInvalidChar(Int32 pos, Char invChar)| at System.Xml.XmlTextReaderImpl.ParseNumericCharRefInline(Int32 startPos, Boolean expand, BufferBuilder internalSubsetBuilder, Int32& charCount, EntityType& entityType)| at System.Xml.XmlTextReaderImpl.ParseNumericCharRef(Boolean expand, BufferBuilder internalSubsetBuilder, EntityType& entityType)| at System.Xml.XmlTextReaderImpl.HandleEntityReference(Boolean isInAttributeValue, EntityExpandType expandType, Int32& charRefEndPos)| at System.Xml.XmlTextReaderImpl.ParseAttributeValueSlow(Int32 curPos, Char quoteChar, NodeData attr)| …
These type of errors that include stack traces indicate .NET Common Language Runtime (CLR) exceptions.
Even though the process was not crashing (the exception was handled), we can still use a tool called DebugDiag to create memory dumps on specific types of first chance .NET exceptions.
This approach is only useful if the type of exception being investigated is fairly unique for the target process. If it’s one that frequently expected and handled in the code then many dumps would have to be produced and analyzed to find the correct instance of the problem (not good).
Using DebugDiag
For this demo I used Debug Diag 1.2 because it was handy, but there is a newer version, 2.0.
When the tool opens it will present a wizard that will help you create a new rule. Select Crash type. The next screen will let you choose a target type:
Fig 1: Debug diag 1.2 target type selection. Choose A specific process
Fig 2: Select process (should have the .exe extension, unlike this)
Click Exceptions under Advanced Settings:
Fig 3: Advanced Configuration
On the next screen click the Add Exception button, which then shows:
Fig 4: Configure Exception with a specific .NET Exception Type
Click on the second item in the list, CLR (.NET) 1.0 – 3.5 Exception, which fills in the hex code E0434F4D. This code is used for all .NET exceptions for the given versions of the framework.
The key here is filtering on the specific .NET exception type, which allows us to only trigger on System.Xml.XmlException in this case. Then make sure the correct action type and limit is selected: we want a full userdump with a limit of 5.
The reason I chose that exception type was it seemed to be the most specific error in the stack trace:
System.Xml.XmlException: '|', hexadecimal value 0x0B, is an invalid character. Line 1, position 63651
After confirming the selection and finishing the wizard you can enable the rule and DebugDiag will start monitoring the process for the exception type specified. It will tell how many dumps it’s collected and where they are located on disk.
After the collecting the data it’s important to disable the rule and also make sure Performance logging is turned off under Tools -> Options and Settings -> Performance Log
Conclusion
DebugDiag can be very useful for getting memory dumps of certain types of .NET exceptions. The memory dumps can greatly speed up root cause analysis of the problem, allowing us to fully reproduce the behavior and validate potential workarounds and fixes.