One afternoon our primary CSV went into redirected mode, this is a normal occurrence during backup operations, but no backup was scheduled and we were not able to turn off redirected mode. We had to schedule a short outage, fully power off the Hyper-V hosts and power back on. After a full investigation turned up nothing we put this down to an anomaly. Another 3 weeks later and the problem happened again, this time we scheduled a longer outage so we could investigate the problem more thoroughly.
During testing we discovered that one of the 3 hosts was causing the issue, when it was removed from the cluster, no problems, when it was in the cluster the CSV in question would randomly go into redirected mode. Logs of the SAN and Hyper-V hosts turned up nothing and all the cluster tests passed perfectly.
Unfortunately during our testing, we encountered a bigger problem. When bringing the faulty host back online for the 3rd time, the CSV itself disappeared on the 2 healthy hosts, the CSV was still visible on the 3rd host. We promptly removed the 3rd host from the cluster but the CSV did not reappear on the 2 healthy hosts.
What didn't work
We tried a number of processes to get the volume to re-appear.
- Rescanning/refreshing in Disk Manager
- Deleting and re-adding the CSV
- Repairing the CSV
- Restarting the Hyper-V hosts
- Removing the faulty host from the SAN LUN zones
At this point we were a little worried, our primary CSV was displayed in windows as an empty disk (as above) With the Failover Cluster tools we checked out the DiskSignature and we were greeted with a grim 0.
Command: cluster resource VMs /priv
D VMs DiskSignature 0 (0x0)
Scanning the FailoverClustering event logs we turned up the following events:
Event ID: 1568
Source: FailoverClustering
Cluster physical disk resource 'VMs' cannot be brought online because the associated disk could not be found. The expected signature of the disk was 'F62FC592'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.
and
Event ID: 1568
Source: FailoverClustering
Cluster disk resource 'VMs' found the disk identifier to be stale. This may be expected if a restore operation was just performed or if this cluster uses replicated storage. The DiskSignature or DiskUniqueIds property for the disk resource has been corrected.
This was repeated over and over, the Cluster was trying to repair the problem but not having any success.
The Solution
After reading this thread we noticed in the last post a user mentioned "a Microsoft tech fixed the problem, the disk first sector was corrupted" We decided a partition table scan and re-write were worth a shot.
Using testdisk I was able to successfully recover the volume by first analyzing the disk for partitions then writing the changes.
I then re-wrote the disk signature (which I found in the FailoverClustering logs, as per above) to the volume using the below command.
CLUSTER RESOURCE VMs DiskSignature F62FC592
The volume then successfully came online, phew and all within my outage window!