Smart IP (Formerly called Virtual IP or Cluster IP) Issues on Windows Server 2008 R2

The Change in Gratuitous ARP Behavior Under Windows 2008 Breaking the Fail-Over of Smart IP Address


Micorosoft has released the following HotFix Regarding This Issue

http://support.microsoft.com/kb/2582281 


Please obtain the fix from Microsoft directly as this seems to change as newer security fixes are made to the OS.

As of May 7, 2013, this issue can be fixed with the hotfix. Please be sure that your Windows 2008 R2 is patched SP1 and most recent security updates being applied.


We have found that the change in the Gratuitous ARP behavior under Windows 2008 will cause some routers (typically CISCO) from not recognizing the Smart IP fail-over from one machine to another.

The key symptom is that the IP address fail-over DOES occur, we can ping and access the VIP address from other servers and machines that on the same network, but across the subnet over a router, the change appears to have not been detected. It will actually be detected but it will be over minutes and hours before the router realizes of the changes.

What is GARP?

Gratuitous ARP request is an Address Resolution Protocol (ARP) request packet where the source and destination IP addresses are both set to the address of the machine issuing the packet and the destination MAC is the broadcast address ff:ff:ff:ff:ff:ff. This causes other network devices on the network to refresh the ARP cache.
Up to Windows Server 2003, this worked instantly because when a network interface address change happened, the server sent a Gratuitous ARP request so that other devices on the network, most importantly to the the router would detect this change. When a router becomes aware of this change it binds to the new MAC address of the IP address and then the new traffic is routed accordingly. This behavior was eliminated in Windows 2008 Server most likely due to a security measure.

When a gratuitous ARP is sent by a Windows Vista or Windows Server 2008, the GARP sent to the network has the SPA field in the initial request set to 0.0.0.0. When a gratuitous ARP is received by Windows Vista or Windows Server 2008, these systems will not update their cache with incorrect information 0.0.0.0 (on purpose). This way the ARP or neighbor caches of systems receiving this request are not updated if the IP address is duplicated. 

Explanations of the GARP Changes in Windows 2008 and Vista

Possible Code Example to "home brew" a network driver to solve this issue by Sending GRAP in Windows is here: http://msdn.microsoft.com/en-us/library/aa366358(VS.85).aspx

Immediate Workaround Information


At least on our CISCO based environment the following was found to work without the use of the Gratuitous ARP
  • Same Subnet? What's That?

    This is a very simple and not complete description, but will give you the basic idea of what "subnet mask" is all about.
    • You see a network address 192.6.158.8
    • If you then see a network address 192.6.158.112
    Then they are likely on the same subnet because 192 and 6 and 158 all match and the only the last number is dfferent.

    So systems at 192.170.1.12 and 192.6.155.18 are on different sub-networks because the second number is different in both cases. Likewise, 192.6.159.18 is also likely on a different network because 159 is not same as 158.

    Now, you got the basic idea based on the class-C subnet mask. Different  networks use different netmasks so my explanation is not always true, but in most cases most people use the class-C. If you want the complete explanation, you will need to study about the Subnet Mask Concept.

    Manually move the IP Address from one system to another using the Network Adapter configuration UI, or use the netsh script IP configuration command (see later on this document.)
  • Pinging the Smart IP continuously from a machine that has never hosted the Smart IP but on the same subnet.  For example utilising an witness server, issue a series of ping to the Smart IP. This has not been completely reliable.
Note that ARP deals with the Physical Layer of the networking. ARP does not cross over different networks. So this is why this works only on the same subnet.

CISCO Router and ASA IOS Information


For most CISCO routers, the default ARP cache timeout (out of the box) appears to be set to 4 hours. This means if we wait for 4 hours the Smart IP will eventually become routed. It is possible to immediately clear the ARP cache by typing in the following IOS command to the router.

clear arp

To change the default timeout value the IOS command is:

arp timeout <number of seconds>

for example, this will set the timeout to 30 seconds

arp timeout 30


Alternative to Using the Smart (Cluster) IP


Before considering the alternatives, please be sure that applying the HOTFIX (http://support.microsoft.com/kb/2582281) resolves the issue.
After we apply the hotfix on both servers, please test the fail-over and send data for each server we perform the fail-over.

If this technique fails then here are the alternatives;
  • Program two destinations with the formal IP addresses of each server in the Ultrasound machine and train the sonographer to try sending to either address. Please note that if both servers are up and running the data can be sent to either one. Thus if a problem is noted by the sonographers, then they can try sending to the alternate address at that time. If the change is known then of course,  you can advise them of the change. The advantage of this is that you so long as the servers are functional, the send will succeed.

    Please also note that the Smart IP destination can still be kept, in effect, having 3 destinations; Smart IP, Server1, Server2. 

  • If you have a router programming expertise and also if your router can perform Software Load Balancing (SLB) then Imorgon can install a web based SLB agent that will tell the router the status of Active server. In this situation, the Network Address Translation from the active server's formal IP address to a Smart IP address is handled in the router. The advantages for this approach are that sonographers will not need to change the destinations and transparent automatic server fail-over is well supported. 

    Note that Imorgon does not provide router programming consulting.

Script Based Addition and Removal of Smart IP Address


Windows has netsh command to allow configuration of the Ethernet interface. Note that this command requires an elevated permission level (i.e., "Run as Administrator.")

Example To Add a Smart IP Address of 172.16.16.123 to the Local Area Connection Interface.


netsh interface ip add address name="Local Area Connection" addr=172.16.16.123 mask=255.255.255.0 gateway=172.16.16.1 gwmetric=1

Example To Remove a Smart IP Address of 172.16.16.123


netsh interface ip delete address name="Local Area Connection" addr=172.16.16.123

Link Layer Topology Discovery Mapper Affecting Smart IP Automatic Configurations


The Imorgon Server has a cluster management service called Imorgon Server Monitor Service. This automatically detects a cluster fail-over and assigns the DICOM Storage SCP end-point Cluster IP Addrses. This is often called a "Smart IP", "Cluster IP", or "Floating IP" address in various Imorgon literature or customer communications.

Why Is This Needed?


Imorgon hosts a fully mirrored servers for its high availability solution (HAS). In modern software, client software accessing the mirrored server can automatically detects the available server and establishes communication. In most modern Internet computing scenario, this type of fail-over can be handled either by a load balancing router or intelligent DNS server directing traffic transparently to whichever server is alive. However most modalities, especially the older systems, are neither capable of automatically detecting this type of condition or be able to use DNS.

How Does This Work?


To address this issue, Imorgon servers hosts multiple IP addresses for each server, where one of the IP address is to be used to point DICOM modalities and used as a DICOM endpoint address. Should one server needs to be taken down, or otherwise fails to operate the other servers can automatically detect a down condition and the DICOM endpoint address gets activated on the surviving server. 

When this condition occurs, the Imorgon Server Monitor Service uses the Microsoft WMI object to add or remove the IP address from the server's Network Interface (e.g., the Ethernet or a Smart Machine's Network Interface.)
This IP address is configured in the Imorgon database table. 

Why This Does Not Work on Some Windows 2008 R2


On some installations there are Link-Layer Topology Discovery Features turned on the TCP/IP interfaces. We have found out that when Link-Layer Topology Discover Mapper I/O Driver is enabled, the WMI command gets ignored.

To fix this issue, open the TCP/IP control panel property and disable the Link-Layer Topology Dicovery Mapper I/O Driver feature as illustrated below.



Comments