/var/log/metasplo.it: April 2013

Tuesday, 30 April 2013

Microsoft Failover Cluster CSV Volume Disappear

We recently began experiencing some problems with the 3rd member of our Windows Failover Cluster. Our cluster consists of 3 servers running 2008 R2 SP1, running Failover cluster manager with a SAN backend. This SAN presents a number of Cluster Shared Volumes (CSV) to the servers, all of our data sits on these CSV's.

One afternoon our primary CSV went into redirected mode, this is a normal occurrence during backup operations, but no backup was scheduled and we were not able to turn off redirected mode. We had to schedule a short outage, fully power off the Hyper-V hosts and power back on. After a full investigation turned up nothing we put this down to an anomaly. Another 3 weeks later and the problem happened again, this time we scheduled a longer outage so we could investigate the problem more thoroughly.

During testing we discovered that one of the 3 hosts was causing the issue, when it was removed from the cluster, no problems, when it was in the cluster the CSV in question would randomly go into redirected mode. Logs of the SAN and Hyper-V hosts turned up nothing and all the cluster tests passed perfectly.

Unfortunately during our testing, we encountered a bigger problem. When bringing the faulty host back online for the 3rd time, the CSV itself disappeared on the 2 healthy hosts, the CSV was still visible on the 3rd host. We promptly removed the 3rd host from the cluster but the CSV did not reappear on the 2 healthy hosts.

What didn't work

We tried a number of processes to get the volume to re-appear.

Rescanning/refreshing in Disk Manager
Deleting and re-adding the CSV
Repairing the CSV
Restarting the Hyper-V hosts
Removing the faulty host from the SAN LUN zones

At this point we were a little worried, our primary CSV was displayed in windows as an empty disk (as above) With the Failover Cluster tools we checked out the DiskSignature and we were greeted with a grim 0.

Command: cluster resource VMs /priv

D VMs DiskSignature 0 (0x0)

Scanning the FailoverClustering event logs we turned up the following events:

Event ID: 1568

Source: FailoverClustering

Cluster physical disk resource 'VMs' cannot be brought online because the associated disk could not be found. The expected signature of the disk was 'F62FC592'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.

and

Event ID: 1568

Source: FailoverClustering

Cluster disk resource 'VMs' found the disk identifier to be stale. This may be expected if a restore operation was just performed or if this cluster uses replicated storage. The DiskSignature or DiskUniqueIds property for the disk resource has been corrected.

This was repeated over and over, the Cluster was trying to repair the problem but not having any success.

The Solution

After reading this thread we noticed in the last post a user mentioned "a Microsoft tech fixed the problem, the disk first sector was corrupted" We decided a partition table scan and re-write were worth a shot.

Using testdisk I was able to successfully recover the volume by first analyzing the disk for partitions then writing the changes.

I then re-wrote the disk signature (which I found in the FailoverClustering logs, as per above) to the volume using the below command.

CLUSTER RESOURCE VMs DiskSignature F62FC592

The volume then successfully came online, phew and all within my outage window!

Monday, 8 April 2013

DPM 2012 SP1 replica inconsistent - datasource is owned by a different DPM server

Recently we took the leap of faith to DPM 2012 SP1 + Update Rollup 1. SP1 offers proper compatibility with SQL 2012 and Server 2012, products we have begun using within our organization The initial install and management went without a hitch, in fact it went eerily too well.

The follow month however wasn't such smooth sailing, within a few days a number of SQL data sources belonging to two different protection groups began failing with "DPM could not run the backup/recovery job for the data source because it is owned by a different DPM server.". The error description went on to say the "Owner DPM Server: ." claiming "." owned the DPM job.

This was an unusual error to receive as there has only ever been a single DPM server within the organization so the possibility of another DPM server owning the job was highly unlikely.

The problem in detail

The 5 data sources that were failing were all SharePoint data sources. We are using a Sharepoint 2010 Farm protection group (PG) and backing up any SQL resources that arn't covered in this PG with a simple SQL PG. "Sharepoint_Config" was one of the failing resources, as well as 4 SQL jobs "Application_Registry_Service", "Bdc_Service_DB", "Managed Metadata Service" and "PerformancePoint Service Application".

DPM complained that the "Replica is inconsistent" and attached the following detailed error description:

"The replica of SQL Server 2008 database SERVER\Application_Registry_Service on server.contoso.internal is inconsistent with the protected data source. All protection activies for data source will fail until the replica is synchronized with consistency check. You can recover data from existing recovery points, but new recovery points cannot be created until the replica is consistent.
For SharePoint farm, recovery points will continue getting created with the databases that are consistent. To backup inconsistent databases, run a consistency check on the farm. (ID 3106)
DPM could not run the backup/recovery job for the data source because it is owned by a different DPM server.
Data Source: SERVER\Application_Registry_Service
Owner DPM Server: . (ID 3184 Details: The file or directory is corrupted and unreadable (0x80070570))"

DPM suggests to take ownership, I attempted this, re-ran the consistency check and within 5 minutes received the errors messages again.

Our logs were also complaining of communication problems, which we initially put down to network issues, but this theory was quickly debunked as other data sources on the same server were successfully backing up.

FsmBlock.cs(178) 2DE6593E-B086-4002-9205-0A57B65BDC8E WARNING Backup.DeltaDataTransferLoop.CommonLoop : RAReadDatasetDelta, StatusReason = Error (StatusCode = -2147023671, ErrorCode = CommunicationProblem, workitem = 70aeaa93-b090-4a3c-bea0-c6fd1a1b4625)

01 TaskExecutor.cs(843) 2DE6593E-B086-4002-9205-0A57B65BDC8E FATAL Task stopped (state=Failed, error=RmAgentCommunicationError; -2147023671; WindowsHResult),

We also tried removing the Protect Group and re-adding it, checking SQL permissions, repairing the DPM agent, un-installing and re-installing the Agent, all of which failed.

The Solution

This was a very difficult one to troubleshoot, as SP1 was so new, no one had published details of experiencing similar problems.

Eventually we tracked down the issue to be ActiveOwner problems on the SQL server. The ActiveOwner files are located in "c:\program files\Microsoft Data Protection Manage\dpm\activeowner" on the server hosting the databases (SQL server being backed up). These ActiveOwner are used to manage ownership of databases, important for ensuring multiple DPM servers aren't attempting to backup/restore resources contemporaneously.

After opening the directory and locating the ActiveOwner files for the failing databases, we noticed they were all 0 KB, while healthy ActiveOwner files were 1 KB and contained the name of the owner DPM server.

1. Open "c:\program files\Microsoft Data Protection Manage\dpm\activeowner" on the database server.

2. Rename any 0 KB files to <name>.old

3. Run SetDpmServer from ""c:\program files\Microsoft Data Protection Manage\dpm\bin"Syntax: SetDpmServer.exe -DpmServerName <SERVER>
Replace server with the computer name of your DPM server.

4. Re-run your synchronization

This fix literally takes 3 minutes, yet it took us an entire week of investigation to come to this conclusion.