vRealize Automation Ecosystem Vmware

Troubleshooting vRealize Automation and MS DTC

An often-overlooked component of the vRA IaaS infrastructure is the Microsoft Distributed Transaction Coordinator (MS DTC). This blog deals with common problems and troubleshooting techniques specifically targeting vRA. In my line of work, I often see that different organizations have different security and configuration standards. vRA has some strict DTC requirements, which ensure that all components will work, but many of you may require stricter security and non-standard configurations. Moreover, once the configuration process deters from the well-trodden trail, scary long-legged problems arise, strange things happen and the more you try to resolve them the messier it gets. Suddenly, the whole situation resembles a dark and scary world from a sci-fi TV series. Welcome to the Upside Down.

Symptoms

How do we know that there’s something wrong with MS DTC? I know that something’s not right when I go to the infrastructure tab of the Automation Console and see ugly messages like these:

ms dtc errors

These messages come from the Manager Service complaining about having trouble with executing queries. If you log onto the Manager server and open the All.log file, you will see a message not quite different from this one:

“System.ApplicationException: Error executing query usp_SelectAgent  —> System.ApplicationException: Error executing query usp_SelectAgentCapabilities  —> System.Transactions.TransactionManagerCommunicationException: Communication with the underlying transaction manager has failed. —> System.Runtime.InteropServices.COMException: The MSDTC transaction manager was unable to pull the transaction from the source transaction manager due to communication problems. Possible causes are: a firewall is present and it doesn’t have an exception for the MSDTC process, the two machines cannot find each other by their NetBIOS names, or the support for network transactions is not enabled for one of the two transaction managers. (Exception from HRESULT: 0x8004D02B)”

The real error message is a little longer, but I’ll spare you a few lines of .NET stack trace mumbo-jumbo for the sake of brevity and to-the-point style of writing. Of course, the average systems administrator would just attempt to restart the Manager Service in a hope that maybe the service went berserk. Trust me, I also do it. Sometimes even repeat it once or twice while mumbling voodoo spells and blowing magical dust, but more often than not it does not help. Once you’ve made sure it’s not something intermittent you can take a closer look at the message which clearly states the MS DTC manager was unable to pull a transaction. This means we’ve got an “uh-oh” situation with the DTC on either the Manager server or the SQL server.  Or both (scary).

 

Troubleshooting

  • The Service should be running and its settings should look the same as on this picture:ms dtc service options

I should mention something important here – the service should always, always run with the Network Service account. Some organizations love to mess with the service accounts in order to strengthen security, but Network Service is an account that has been stripped of all redundant roles and privileges and has just the minimum needed to properly run services. If you don’t trust me, ask Microsoft.

  • NetBIOS – check your DNS and WINS resolution, both forward and reverse. If it doesn’t work, then this might be the reason for your problems.
  • Firewall. In the age of Twitter nobody reads past the 140th character, but if you take a closer look at the error message you’ll see a suggestion. It says you should revisit your firewall settings. For strange or legacy reasons the DTC relies on RPC.  There is one rule that needs to be configured on the Windows Firewall for both inbound and outbound communications – Distributed Transaction Coordinator (TCP-Out). If you don’t have it predefined just click on New Rule and type the path to the program you want to allow – %SystemRoot%\system32\msdtc.exe.ms dtc firewall options

    Don’t forget to open port 135 alongside all the ports between 1024 and 65535 because once a session is successfully established on 135, the communication continues over a random port.  You can limit these ports. On a side note, there are two other inbound rules regarding DTC that you can configure, but both of them allow the DTC service management through RPC and are not directly related to the workings of the Manager Service.

  • DTC Trace log. Very often, reading this log will help you find a solution to your problems. In order to read the DTC log you will need a special tool called tracefmt.exe, which you can find on the Windows Driver Kit. Just make sure you get the 64bit version. Copy this tool to C:\Windows\System32\MsDtc\Trace on the Manager server, but do it with the Command Prompt. Don’t try to mess with the folder’s permission by going there with Windows Explorer. Now, restart the DTC service so you have a clean start and issue the following command:
    C:\Windows\System32\MsDtc\Trace\msdtcvtr.bat -MODE 1
  • DTCPing and DTCTester – all the info is on the Internet, I won’t discuss their use here, but they can be very helpful.
  • DTC Settings – This KB perfectly describes the needed configuration on both your Manager server and SQL Server.
  • Well, almost perfectly – it does not mention Authentication. Here’s what you should do in regards to DTC authentication:
    • If your SQL Server is not part of a Failover Cluster, then you should select Mutual Authentication Required on both servers.
    • If your SQL Server is part of a Failover Cluster and you have a clustered DTC you should select Incoming Caller Authentication Required on the clustered DTC and the Manager Server. This is the generally recommended configuration for SQL Clusters.
    • If your SQL Server is part of a Failover Cluster and you do not have a clustered DTC you should select Mutual Authentication Required on each SQL node’s DTC and the Manager Server.
    • I mentioned Failover Cluster and I should make something clear about the clustered DTC: When the Manager Server connects to a clustered SQL instance, it first connects to the SQL node’s Local DTC, which redirects all consequent traffic to the Clustered DTC. This means that when you configure firewalls, DNS and networking, you should configure RPC and SQL access to each SQL Server node, the SQL Server Failover VIP and the clustered DTC VIP. This can be easily overlooked as often people allow traffic to the SQL Server instance but do not configure it for the DTC instance.

      Messy, huh? Just look at the picture below if my sentences sound like a verbal spaghetti bowl:clustered ms dtc

To conclude, this is not a definitive guide to DTC troubleshooting and configuration. But, I think it can serve as a good start for people trying to get their head around how vRA and SQL Server communicate. It can be messy and dark but in the end it can also be very satisfying to know you have defeated yet another scary creature coming out of the IT parallel world.

Take a vRealize Automation 7 Hands-On lab!