On a recent project, my colleagues and I were asked to write an application that had quite a few moving parts to it- one of which was a pretty standard windows service. In our WiX installer configuration, we specified that the NetworkService account should be used to start up the service.
We tested this configuration on multiple systems, both real and virtual at our local site and everything worked just as we expected. Awesome, ship it!
The trouble started once we delivered the build to our client’s QA department. On their systems, on their test network, we saw the following error message when the installer reached the point it was trying to start the service:
Error 1920. Service OurService (OurService) failed to start. Verify that you have sufficient privileges to start system services.
What the heck? And then the real head scratching began.
Normally, when a windows service fails to start with a message like this, it is due to a genuine permissions problem from the service startup user not having permissions or the install user having insufficient privileges. So we checked first that the service install folders had the correct permissions, that the user running the installer had Admin privileges, etc. Everything seemed to be in order.
Then, we tried to change the service startup user to the account of the QA person doing the testing- and it worked! Aha! We thought, we are on the trail now. So we created a new user, and gave that user the exact permissions the QA person’s account had. And the service failed to start.
After much subsequent testing, we established the following list of facts:
1. The service would start only with the QA account
2. The service would start if the LAN was unplugged (!?)
3. The service would start if the test system was on a different network.
4. The QA system was using a proxy- disabling or enabling this had no impact on the install (hmm!)
So now things were truly getting perplexing- and to make things even more interesting, the QA user in a last ditch effort to troubleshoot the issue was running software to listen to the active ports, and reported that our service was trying to communicate with an external web address! Since there is no code in our service that talks to anything outside the network, and in fact our service wasn’t even running this made no sense at all.
This was the best clue we had though, and kudos to that QA person for giving it a go. In looking at the addresses that were being communicated to, we noticed two- one was for Akamai the other for Microsoft. One is a caching services and the other is… well, Microsoft.
And what was the request to the Microsoft site? GET /pki/crl/products/CSPCA.crl.
A little googling and we realized what was going on.
http://softwareblog.morlok.net/tag/crl/ (a nice explanation of the way the CRL works)
Our service uses a lot of .NET 3.5 assemblies, and Microsoft has been digitally signing their assemblies for quite some time. As part of the service starting, .NET is then of course loading these signed assemblies and needs to check the Certificate Revocation List (CRL) to make sure that all is okay with that assembly. With the LAN unplugged, .NET is smart enough to know that no network interface is available, and the service starts. But with a PROXY, the network interface is up but unable to connect to the CRL site, and the CRL request times out after 15 seconds.
But a Windows Service needs to be responsive, and 15 seconds is too long for it to just sit there waiting, so it puked with the 1920 error before that.
The fix? Thankfully very simple! Adding these 3 lines of code to the service .config file disables the check for the CRL (in .NET 2.0 and later only):
Such a simple fix! What took so long to find it? There were several contributing factors:
1. Unfamiliarity with this issue beforehand (and if you have run into it once, you wont forget!)
2. Our unfamiliarity with the customer testing environment
3. The Customers unfamiliarity with their own environment
4. Overly complex testing systems
5. Distance between the customer and ourselves
These all played a role in the time it took to find the –real- source of the problem rather than chasing ghosts.
Hopefully this will help someone else out there avoid this pitfall.