Overview
I recently ran across a problem at a client with a new Windows Server 2008 domain controller where network services including DNS would become unavailable after roughly 30 days of uptime, causing many other services to fail. Rebooting the server would fix the problem. This problem occurred over a period of months, towards the last day of every month, and the interesting thing was that there were actually no errors reported in the event log relating to the DNS Server service itself, even though name resolution failing appeared to be a major part of the problem.
Issue
Quoted from Microsoft:
When software that uses Transport Driver Interface (TDI) drivers, such as some antivirus software, is installed on a Windows Server 2008 system or on Windows Vista Service Pack 1 (SP1) system, the handle count of the system process keeps increasing. This problem occurs if the Windows Server 2008 system or the Windows Vista Service Pack 1 (SP1) system is running on a computer that has multiple processors. If this issue occurs for some time, the computer begins to run out of system resources. Therefore, any new Ancillary Function Driver for WinSock (AFD) connection to this computer fails.
This particular client was running a dual core server with Trend Micro OfficeScan 8.0, which used a TDI driver, matching the above problem symptoms.
This problem occurs because of a race condition in which the Tdx.sys driver does not send a disconnect input/output request packet (IRP) indication to the afd.sys driver. When this occurs, the reference count on the AFD socket is not decremented. Eventually, the AFD connection is orphaned. The process that owns the orphaned AFD connection is also orphaned.
After the issue occurs for some time, all available ports are consumed. Therefore, many orphaned processes appear. When resources become exhausted, the problem occurs that the “Symptoms” section describes.
Here are some of the Errors/Warnings generated:
Log Name: System Source: NETLOGON Date: 8/1/2010 11:47:51 PM Event ID: 5775 Task Category: None Level: Error Keywords: Classic User: N/A Computer: - Description: The dynamic deletion of the DNS record '_ldap._tcp.-._sites.DomainDnsZones.-. 600 IN SRV 0 100 389 -' failed on the following DNS server: DNS server IP address: 192.168.1.18 Returned Response Code (RCODE): 5 Returned Status Code: 10055 USER ACTION To prevent remote computers from connecting unnecessarily to the domain controller, delete the record manually or troubleshoot the failure to dynamically delete the record. To learn more about debugging DNS, see Help and Support Center. ADDITIONAL DATA Error Value: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.
Log Name: System Source: Microsoft-Windows-GroupPolicy Date: 7/30/2010 10:17:00 PM Event ID: 1054 Task Category: None Level: Error Keywords: User: SYSTEMComputer: - Description: The processing of Group Policy failed. Windows could not obtain the name of a domain controller. This could be caused by a name resolution failure. Verify your Domain Name Sysytem (DNS) is configured and working correctly.
Log Name: System Source: NETLOGON Date: 7/30/2010 6:35:41 PM Event ID: 5807 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: - Description: During the past 4.25 hours there have been 6627 connections to this Domain Controller from client machines whose IP addresses don't map to any of the existing sites in the enterprise. Those clients, therefore, have undefined sites and may connect to any Domain Controller including those that are in far distant locations from the clients. A client's site is determined by the mapping of its subnet to one of the existing sites. To move the above clients to one of the sites, please consider creating subnet object(s) covering the above IP addresses with mapping to one of the existing sites. The names and IP addresses of the clients in question have been logged on this computer in the following log file '%SystemRoot%\debug\netlogon.log' and, potentially, in the log file '%SystemRoot%\debug\netlogon.bak' created if the former log becomes full. The log(s) may contain additional unrelated debugging information. To filter out the needed information, please search for lines which contain text 'NO_CLIENT_SITE:'. The first word after this string is the client name and the second word is the client IP address. The maximum size of the log(s) is controlled by the following registry DWORD value 'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters\LogFileMaxSize'; the default is 20000000 bytes. The current maximum size is 20000000 bytes. To set a different maximum size, create the above registry value and set the desired maximum size in bytes.
Log Name: Directory Service Source: Microsoft-Windows-ActiveDirectory_DomainService Date: 7/31/2010 12:55:41 AM Event ID: 1308 Task Category: Knowledge Consistency Checker Level: Warning Keywords: Classic User: ANONYMOUS LOGON Computer: - Description: The Knowledge Consistency Checker (KCC) has detected that successive attempts to replicate with the following directory service has consistently failed. Attempts: 16 Directory service: CN= - Period of time (minutes): 125 The Connection object for this directory service will be ignored, and a new temporary connection will be established to ensure that replication continues. Once replication with this directory service resumes, the temporary connection will be removed. Additional Data Error value: 1256 The remote system is not available. For information about network troubleshooting, see Windows Help.
Solution
Microsoft has released a hotfix to rectify this issue, which is available here: http://support.microsoft.com/kb/961775