Replication writer failing

I have a performance issue that I believe, without prejudicing feedback, is going to be related to NetApp SnapManager for Exchange. Just in case this isn’t related to this backup software I wanted to get feedback from other Exchange experts. Our environment, at a Mailbox level, consists of two Mailbox servers in a production datacentre and one Mailbox server in an offsite datacentre comprising a 3 node dag. All databases are mounted on the two production servers and are roughly evenly split in terms of which server has mounted copies. The server OS is Server 2008 R2 Enterprise SP1, Exchange is 2010 SP2 RU3.

With no distinct pattern I see the performance of either production Mailbox server drop to a point that the end user experience is impacted. On investigation I can see that the databases that are mounted by the server experiencing the issue are healthy state and mounted. The passive copies of the databases held on the two remaining servers (that are mounted and live on the server with an issue) will have high queue lengths. Navigating round the server is painfully slow. If I try and put the server with an issue into maintenance mode (to move the active database gracefully) it will fail with “…WARNING: An error occurred while communicating with the Microsoft Exchange Replication service…”. Running “vssadmin list writers” will show the writer isn’t healthy. If I try and activate the databases from another server via the EMC this will fail with a similar error too. The only option I have is force the server to shut down.

I can see errors that I have researched outlined below. The repetition of these doesn’t always link to a performance hit. The articles I can see aren’t necessarily related to NetApp but do indicate backup issues.

Log Name:      Application
Source:        ESE BACKUP
Date:          07/11/2012 07:07:56
Event ID:      914
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      XXXXXX
Description: Information Store (1508) The surrogate backup by XXXXXX has stopped with error 0xFFFFFFFF.

http://social.technet.microsoft.com/Forums/lt/exchange2010/thread/5f57f48c-9e65-4252-afd2-67c7ebd75a3c

“…Problem solved we switched to stream backup instead of a back with one session with net backup…”

Log Name:      Application
Source:        ESE
Date:          07/11/2012 07:07:56
Event ID:      215
Task Category: Logging/Recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      XXXXXXXXXXX
Description: Information Store (1508) *********: The backup has been stopped because it was halted by the client or the connection with the client failed.http://www.microsoft.com/technet/support/ee/transform.aspx?ProdName=Exchange&ProdVer=6.5.6940.0&EvtID=215&EvtSrc=ESE&LCID=1033
“…The most frequent cause of the error is that a 3rd party online Exchange-aware backup software program has problems…”

The third party backup software will use the Microsoft Exchange Replication service. Backup failures may/may not be related to the failure of the writer – I can’t pinpoint an exact cause and effect relationship there. What I believe is happening is that the backup software uses the writer and on occasion it’s left in a state that Exchange can’t recover from. As such logs can’t be shipped by this service to the other dag members. This accounts for the high queue lengths on the other two servers.

I have had the issue before at the company I currently work for now and it was resolved by an update to NetApp SnapManager for Exchange. I have also had similar issues in the past working for a managed service company and ended up playing vendor tennis between Microsoft and NetApp. I am mindful though that at some stage in the future there could be a RU or SP that could resolve this. I recall having numerous problems in the past there were resolved by Exchange 2010 SP1 RU4.

Any thoughts? The storage team who look after the NetApp and backup side of things don’t think there are any issues.

Latest Images

Trending Articles

Latest Images