Hi All,
We have an Exchange 2010 SP2 Setup with Active/Passive DC Model. Primary Site has 5 Mailbox Servers and DR Site has 3 Mailbox Servers.
All the Servers has been configured as Singale DAG. All the Servers has two NIcs , One for MAPI and One for Replication.
We are frequently facing the Node Network Communication issue between Primary and DR. Please find the events occured during the cluster failover issue.
Event 1177:
==========
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Event 1135:
==========
Cluster node 'TCSCLPRMBX03' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
When i dig in to cluster log, I am seeing some network level issue between the Nodes across the Datacenters.
It looks like some network level issue. Can you please help, How to rectify this..
00000b64.0000093c::2012/11/22-05:14:38.987 INFO [CORE] Node 3: Proposed View is <viewchanged 2="" 3="" 5="" 6="" 7)="" downers="()" form="false/" joiner="false" joiners="()" newView="47306(1" oldView="47203(1">00000b64.0000093c::2012/11/22-05:14:38.987 INFO [RGP] Node 3: Stable_`0 => Opening`1 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <class mscs::detail::ConsensusMessage="">00000b64.0000093c::2012/11/22-05:14:38.987 INFO <senderid>3 </senderid> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <bestepochseen>473</bestepochseen> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <laststableview>47203(1 2 3 5 6 7) </laststableview> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <proposedview>47306(1 2 3 5 6 7) </proposedview> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <stage>Opening`1 </stage> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <instage>() </instage> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <instageprev>(3) </instageprev> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <joiners>() </joiners> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <trimmednodes>() </trimmednodes> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <innerscreen>(1 2 3 5 6 7) </innerscreen> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <pruningresult>(1 2 3 5 6 7) </pruningresult> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <matrix>00000b64.0000093c::2012/11/22-05:14:38.987 INFO <connectivitymatrix>00000b64.0000093c::2012/11/22-05:14:38.987 INFO <row 00000000000000000000000000000000="" id="0">00000b64.0000093c::2012/11/22-05:14:38.987 INFO </row></connectivitymatrix> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO </matrix> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO <gemblob><counted_ptr p="nullptr"></counted_ptr></gemblob> 00000b64.0000093c::2012/11/22-05:14:38.987 INFO </class> 00000b64.00001434::2012/11/22-05:14:38.987 INFO [RGP] Node 3: Timer Tick Started 00000b64.00001434::2012/11/22-05:14:38.987 INFO [RGP] Node 3: stage Opening`1 missing nodes (1 2 5 6 7) 00000b64.00001434::2012/11/22-05:14:38.987 INFO [RGP] Node 3: Timer Tick Ended 00000b64.000010cc::2012/11/22-05:14:38.987 INFO [RGP] Node 3: incoming node 1 is in stage Opening`1(1) oldView:47203(1 2 3 5 6 7) proposed:47306(1 2 3 5 6 7) 00000b64.000010cc::2012/11/22-05:14:38.987 INFO [RGP] Node 3: merging inStage (3) + (1) = (1 3) (sender 1 in stage Opening`1) 00000b64.000010cc::2012/11/22-05:14:38.987 INFO [RGP] Node 3: stage Opening`1 missing nodes (2 5 6 7) 00000b64.00000468::2012/11/22-05:14:38.987 INFO [RGP] Node 3: incoming node 3 is in stage Opening`1(3) oldView:47203(1 2 3 5 6 7) proposed:47306(1 2 3 5 6 7) 00000b64.00000468::2012/11/22-05:14:38.987 INFO [RGP] Node 3: merging inStage (1 3) + (3) = (1 3) (sender 3 in stage Opening`1) 00000b64.00000468::2012/11/22-05:14:38.987 INFO [RGP] Node 3: stage Opening`1 missing nodes (2 5 6 7) 00000b64.00000ec0::2012/11/22-05:14:38.987 INFO [RGP] Node 3: incoming node 5 is in stage Opening`1(1 2 5 7) oldView:47203(1 2 3 5 6 7) proposed:47306(1 2 3 5 6 7) 00000b64.00000ec0::2012/11/22-05:14:38.987 INFO [RGP] Node 3: merging inStage (1 3) + (1 2 5 7) = (1 2 3 5 7) (sender 5 in stage Opening`1) 00000b64.00000ec0::2012/11/22-05:14:38.987 INFO [RGP] Node 3: stage Opening`1 missing nodes (6) 00000b64.00001460::2012/11/22-05:14:38.987 INFO [RGP] Node 3: incoming node 7 is in stage Opening`1(1 2 7) oldView:47203(1 2 3 5 6 7) proposed:47306(1 2 3 5 6 7) 00000b64.00001460::2012/11/22-05:14:38.987 INFO [RGP] Node 3: merging inStage (1 2 3 5 7) + (1 2 7) = (1 2 3 5 7) (sender 7 in stage Opening`1) 00000b64.00001460::2012/11/22-05:14:38.987 INFO [RGP] Node 3: stage Opening`1 missing nodes (6) 00000b64.00001100::2012/11/22-05:14:39.205 DBG[NETFTAPI] Signaled NetftRemoteUnreachable event, local address 10.11.x.x:003853 remote address 10.10.x.x:003853 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] got event: Remote endpoint 10.10.x.x:~3343~ unreachable from 10.11.x.x:~3343~ 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Marking Route from 10.11.x.x:~3343~ to 10.10.x.x:~3343~ as down 00000b64.00001444::2012/11/22-05:14:39.205 INFO [NDP] Checking to see if all routes for route (virtual) local fe80::9d8a:3616:e639:7812:~0~ to remote fe80::93f:e8a5:7d4f:32e4:~0~ are down 00000b64.00001444::2012/11/22-05:14:39.205 INFO [NDP] All routes for route (virtual) local fe80::9d8a:3616:e639:7812:~0~ to remote fe80::93f:e8a5:7d4f:32e4:~0~ are down 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Adding information for route Route from local 10.11.x.x:~0~ to remote 10.10.x.x:~0~, status: true, attributes: 0 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Adding information for route Route from local 10.11.x.x:~0~ to remote 10.10.x.x:~0~, status: true, attributes: 0 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Adding information for route Route from local 10.11.x.x:~0~ to remote 10.10.x.x:~0~, status: true, attributes: 0 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Adding information for route Route from local 10.11.x.x:~0~ to remote 10.10.x.x:~0~, status: false, attributes: 0 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Adding information for route Route from local 10.11.x.x:~0~ to remote 10.10.x.x:~0~, status: true, attributes: 0 00000b64.00001444::2012/11/22-05:14:39.205 INFO [IM] Sending connectivity report to leader (node 1): <class mscs::InterfaceReport="">00000b64.00001444::2012/11/22-05:14:39.205 INFO <frominterface>6466cff0-1e6e-4b58-ab61-17016b7651ea</frominterface> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <upinterfaces><vector len="5">00000b64.00001444::2012/11/22-05:14:39.205 INFO <item>6466cff0-1e6e-4b58-ab61-17016b7651ea</item> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <item>79a2757c-81d6-4017-b105-03a60a99c693</item> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <item>958754e4-0daa-431a-96e4-b4c16ff82a1e</item> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <item>78e4e592-ea2f-407c-bd65-043ea9177366</item> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <item>7c32b991-6d6b-4b54-96c8-da247fde3147</item> 00000b64.00001444::2012/11/22-05:14:39.205 INFO </vector> 00000b64.00001444::2012/11/22-05:14:39.205 INFO </upinterfaces> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <downinterfaces><vector len="1">00000b64.00001444::2012/11/22-05:14:39.205 INFO <item>53e90450-55d3-4099-ad3f-635feab9ec40</item> 00000b64.00001444::2012/11/22-05:14:39.205 INFO </vector> 00000b64.00001444::2012/11/22-05:14:39.205 INFO </downinterfaces> 00000b64.00001444::2012/11/22-05:14:39.205 INFO <viewid>47203</viewid> 00000b64.00001444::2012/11/22-05:14:39.205 INFO </class> 00000b64.00001428::2012/11/22-05:14:39.205 INFO [CORE] Node 3: executing node 6 failed handlers on a dedicated thread 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NODE] Node 3: Cleaning up connections for n6. 00000b64.00001428::2012/11/22-05:14:39.205 INFO [MQ-MBX02] Clearing 0 unsent and 2 unacknowledged messages. 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NODE] Node 3: n6 node object is closing its connections 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NODE] Node 3: closing n6 node object channels 00000b64.000012bc::2012/11/22-05:14:39.205 INFO [CHANNEL fe80::93f:e8a5:7d4f:32e4%16:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_IO_PENDING(997) 00000b64.000012bc::2012/11/22-05:14:39.205 WARN [PULLER MBX02] ReadObject failed with GracefulClose(1226)' because of 'channel to remote endpoint fe80::93f:e8a5:7d4f:32e4%16:~3343~ is closed' 00000b64.000012bc::2012/11/22-05:14:39.205 ERR [NODE] Node 3:Connection to Node 6 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::93f:e8a5:7d4f:32e4%16:~3343~ is closed' 00000b64.00001428::2012/11/22-05:14:39.205 INFO [CORE] Node 3: Clearing cookie d50143c9-54fb-4196-8647-dca73fc2828c 00000b64.00001428::2012/11/22-05:14:39.205 INFO [CORE] Node 7f1345c8-5ccc-40a8-b58a-a18f932b8bd7: Cookie Cache 7f1345c8-5ccc-40a8-b58a-a18f932b8bd7 00000b64.00001428::2012/11/22-05:14:39.205 INFO [CORE] Node fd545d14-31a8-41a7-995d-7184f91bd310: Cookie Cache fd545d14-31a8-41a7-995d-7184f91bd310 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NETFT] Route <struct mscs::FaultTolerantRoute="">00000b64.00001428::2012/11/22-05:14:39.205 INFO <reallocal>10.11.x.x:~3343~</reallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <realremote>10.10.x.x:~3343~</realremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtuallocal>fe80::9d8a:3616:e639:7812:~0~</virtuallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtualremote>fe80::93f:e8a5:7d4f:32e4:~0~</virtualremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <delay>1000</delay> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <threshold>5</threshold> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <priority>20100</priority> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <attributes>2147483649</attributes> 00000b64.00001428::2012/11/22-05:14:39.205 INFO </struct> 00000b64.00001428::2012/11/22-05:14:39.205 INFO removed 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NETFT] Route <struct mscs::FaultTolerantRoute="">00000b64.00001428::2012/11/22-05:14:39.205 INFO <reallocal>192.168.x.x:~3343~</reallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <realremote>192.168.1.x:~3343~</realremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtuallocal>fe80::9d8a:3616:e639:7812:~0~</virtuallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtualremote>fe80::93f:e8a5:7d4f:32e4:~0~</virtualremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <delay>61000</delay> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <threshold>5</threshold> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <priority>2100</priority> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <attributes>2147483651</attributes> 00000b64.00001428::2012/11/22-05:14:39.205 INFO </struct> 00000b64.00001428::2012/11/22-05:14:39.205 INFO removed 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NETFT] Route<struct mscs::FaultTolerantRoute="">00000b64.00001428::2012/11/22-05:14:39.205 INFO <reallocal>192.168.x.x:~3343~</reallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <realremote>10.10.x.x:~3343~</realremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtuallocal>fe80::9d8a:3616:e639:7812:~0~</virtuallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtualremote>fe80::93f:e8a5:7d4f:32e4:~0~</virtualremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <delay>61000</delay> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <threshold>5</threshold> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <priority>11100</priority> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <attributes>2147483651</attributes> 00000b64.00001428::2012/11/22-05:14:39.205 INFO </struct> 00000b64.00001428::2012/11/22-05:14:39.205 INFO removed 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NETFT] Route <struct mscs::FaultTolerantRoute="">00000b64.00001428::2012/11/22-05:14:39.205 INFO <reallocal>10.11.x.x:~3343~</reallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <realremote>192.168.1.x:~3343~</realremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtuallocal>fe80::9d8a:3616:e639:7812:~0~</virtuallocal> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <virtualremote>fe80::93f:e8a5:7d4f:32e4:~0~</virtualremote> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <delay>61000</delay> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <threshold>5</threshold> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <priority>11100</priority> 00000b64.00001428::2012/11/22-05:14:39.205 INFO <attributes>2147483651</attributes> 00000b64.00001428::2012/11/22-05:14:39.205 INFO </struct> 00000b64.00001428::2012/11/22-05:14:39.205 INFO removed 00000b64.00001428::2012/11/22-05:14:39.205 INFO [NODE] Node 3: Pausing queue sending for n6. 00000b64.00001428::2012/11/22-05:14:39.205 INFO [RGP] Node 3: suspecting nodes (6) [+ (6)] 00000b64.00001428::2012/11/22-05:14:39.205 INFO [RGP] Node 3: will suspect nodes (6) [+ (6)] after regroup is over 00000b64.00001460::2012/11/22-05:14:39.268 INFO [RGP] Node 3: incoming node 7 is in stage Opening`1(1 2 3 5 7) oldView:47203(1 2 3 5 6 7) proposed:47306(1 2 3 5 6 7) 00000b64.00001460::2012/11/22-05:14:39.268 INFO [RGP] Node 3: merging inStage (1 2 3 5 7) + (1 2 3 5 7) = (1 2 3 5 7) (sender 7 in stage Opening`1) 00000b64.00001460::2012/11/22-05:14:39.268 WARN [RGP] Node 3: only local suspects are missing (6). moving to the next stage (shortcut compensation time 04.718) 00000b64.00001460::2012/11/22-05:14:39.268 INFO [RGP] Node 3: Opening`1 => Closing`2 00000b64.00001460::2012/11/22-05:14:39.268 INFO <class mscs::detail::ConsensusMessage="">00000b64.00001460::2012/11/22-05:14:39.268 INFO <senderid>3 </senderid> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <bestepochseen>473 </bestepochseen> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <laststableview>47203(1 2 3 5 6 7) </laststableview> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <proposedview>47306(1 2 3 5 6 7) </proposedview> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <stage>Closing`2 </stage> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <instage>() </instage> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <instageprev>(1 2 3 5 7) </instageprev> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <joiners>() </joiners> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <trimmednodes>() </trimmednodes> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <innerscreen>(1 2 3 5 7) </innerscreen> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <pruningresult>() </pruningresult> 00000b64.00001460::2012/11/22-05:14:39.268 INFO <matrix>00000b64.00001460::2012/11/22-05:14:39.268 INFO <connectivitymatrix>00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000000000000="" id="0">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000000000000="" id="1">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000000000000="" id="2">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000010101010="" id="3">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000000000000="" id="4">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000010000110="" id="5">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000000000000="" id="6">00000b64.00001460::2012/11/22-05:14:39.268 INFO <row 00000000000000000000000010101110="" id="7">00000b64.00001460::2012/11/22-05:14:39.268 INFO </row></row></row></row></row></row></row></row></connectivitymatrix></matrix></class></viewchanged>