This bit a coworker and I just a couple days ago. My coworker was attempting to deploy patches to a Windows Server 2008 R2 virtual server using VMware vCenter Protect. Repeatedly this failed. We went through the troubleshooting processes we were already familiar with and have seen success with in the past:
- See if traffic will pass both from the vCenter Protect console to the troublesome VM and from the troublesome VM back to the vCenter Protect console (using ping; traffic did pass correctly).
- Reboot the troublesome VM. The problem remained.
- Reboot the VM running the vCenter Protect console. He did this twice due to an unrelated issue; the problem remained.
After a short while, my coworker checked the Services console (services.msc) and realized that the Server service was not running. Every attempt to start the service failed and presented this dialog:
Error 67: The network name cannot be found. |
From there we proceeded to several other troubleshooting measures:
- Verify that the VM's IP address and DNS record matched, and that there were no additional IP addresses registered to the name in DNS. DNS records were what they should have been.
- Verify that the VM's DNS server IP addresses were correct. There was a mistake here; I corrected this but it didn't solve the problem.
- Try re-enabling NetBIOS over TCP/IP on a temporary basis. This didn't solve the problem; of course, we didn't expect it to!
- Change the IP address and ensure the new address was properly reflected in DNS. Again, no dice, although I again didn't expect this to have an effect. I set the IP address back to its original value.
- Google!
Of course, when everything you can reasonably think of fails, Google is the next resort. My coworker searched for a while and found a number of other things that all didn't help. At this point I was letting him do everything himself and only tossing some ideas his way here and there, as I was working on another project that I'll probably post about later. One of the more notable things he tried was to compare the Server service's registry settings (located at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\LanmanServer) on the troublesome VM with an identically-configured VM that was working properly and adjust accordingly. It seemed promising, as there were inconsistencies, but resolving these inconsistencies didn't help.
Eventually his Google well ran dry. He finally tried to replace svchost.exe and srvsvc.dll on the troublesome VM with the ones on the identical, working VM. Again, no dice. At this point, I decided it was time for me to take a look. I re-did a few of the troubleshooting steps above just to confirm neither of us were completely nuts yet, then I tried to start the service to generate a log message so I could consult the Event Viewer (eventvwr.msc). I found this message:
Event ID 7023 from Service Control Manager |
I deciced it couldn't hurt to double-check the Google results, so I copied the exact text from that log message and ran it through Google. Interestingly enough, this Microsoft Knowledge Base article was the first result. (My coworker hadn't copied the entire message when he searched, so this article ended up not appearing in the first two or three pages of results.) It seemed odd, but I fired up cmd.exe and ran the path command. Sure enough, at the tail end of the path was an entry that was a UNC path. I proceeded to check the system-wide environment variables through Control Panel and found that the system PATH variable did indeed include the UNC path, similarly to what I've shown here:
UNC path in System PATH variable |
I removed the UNC path and another invalid path that referenced a network drive letter, then rebooted the server. Magically, everything worked again.
Now that everything worked again, it was time to figure out why it happened and how to prevent it again. This server is for a testing instance of an application and is thus very lightly used. According to our event log archive, the coworker I've mentioned here is the only person besides myself who had logged onto this server either interactively or via Remote Desktop in the entire calendar year and his only log on events were from the day this issue occurred. Therefore, it had to have been something I did. This had been working just three weeks ago when he did the initial patch scan with vCenter Protect, so it had to be fairly recent activity.
I consulted my to-do list and found a potential culprit--I had installed a service pack for the application on the server just a week and a half ago. I then checked our notes related to the application and found that the two entries I removed from PATH were added as part of migrating the app from another server to this one. It appears that the person who did this migration (over a year ago!) forgot to remove these entries after the migration was complete. I still can't figure out why it took over a year for this to become a problem, but I'm confident this was the root cause.
In closing, I'd like to point out that Microsoft's article linked above states the following: "A system path that contains a UNC path may cause severe system problems and severe software problems. Therefore, a system path that contains a UNC path is unsupported." Given that this is unsupported and known to cause problems, shouldn't the control panel applet at least give a warning when a UNC path is entered into the system PATH variable?
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.