Monday, 8 April 2013

DPM 2012 SP1 replica inconsistent - datasource is owned by a different DPM server

Recently we took the leap of faith to DPM 2012 SP1 + Update Rollup 1. SP1 offers proper compatibility with SQL 2012 and Server 2012, products we have begun using within our organization  The initial install and management went without a hitch, in fact it went eerily too well.

The follow month however wasn't such smooth sailing, within a few days a number of SQL data sources belonging to two different protection groups began failing with "DPM could not run the backup/recovery job for the data source because it is owned by a different DPM server.". The error description went on to say the "Owner DPM Server: ." claiming "." owned the DPM job.

This was an unusual error to receive as there has only ever been a single DPM server within the organization  so the possibility of another DPM server owning the job was highly unlikely.


The problem in detail

The 5 data sources that were failing were all SharePoint data sources. We are using a Sharepoint 2010 Farm protection group (PG) and backing up any SQL resources that arn't covered in this PG with a simple SQL PG. "Sharepoint_Config" was one of the failing resources, as well as 4 SQL jobs "Application_Registry_Service", "Bdc_Service_DB", "Managed Metadata Service" and "PerformancePoint Service Application".

DPM complained that the "Replica is inconsistent" and attached the following detailed error description:
"The replica of SQL Server 2008 database SERVER\Application_Registry_Service on server.contoso.internal is inconsistent with the protected data source. All protection activies for data source will fail until the replica is synchronized with consistency check. You can recover data from existing recovery points, but new recovery points cannot be created until the replica is consistent.
For SharePoint farm, recovery points will continue getting created with the databases that are consistent. To backup inconsistent databases, run a consistency check on the farm. (ID 3106)
DPM could not run the backup/recovery job for the data source because it is owned by a different DPM server.
Data Source: SERVER\Application_Registry_Service
Owner DPM Server: . (ID 3184 Details: The file or directory is corrupted and unreadable (0x80070570))"

DPM suggests to take ownership, I attempted this, re-ran the consistency check and within 5 minutes received the errors messages again.

Our logs were also complaining of communication problems, which we initially put down to network issues, but this theory was quickly debunked as other data sources on the same server were successfully backing up.
FsmBlock.cs(178)        2DE6593E-B086-4002-9205-0A57B65BDC8E    WARNING    Backup.DeltaDataTransferLoop.CommonLoop : RAReadDatasetDelta, StatusReason = Error (StatusCode = -2147023671, ErrorCode = CommunicationProblem, workitem = 70aeaa93-b090-4a3c-bea0-c6fd1a1b4625)
01    TaskExecutor.cs(843)        2DE6593E-B086-4002-9205-0A57B65BDC8E    FATAL    Task stopped (state=Failed, error=RmAgentCommunicationError; -2147023671; WindowsHResult),

We also tried removing the Protect Group and re-adding it, checking SQL permissions, repairing the DPM agent, un-installing and re-installing the Agent, all of which failed.



The Solution

This was a very difficult one to troubleshoot, as SP1 was so new, no one had published details of experiencing similar problems.

Eventually we tracked down the issue to be ActiveOwner problems on the SQL server. The ActiveOwner files are located in "c:\program files\Microsoft Data Protection Manage\dpm\activeowner" on the server hosting the databases (SQL server being backed up). These ActiveOwner are used to manage ownership of databases, important for ensuring multiple DPM servers aren't attempting to backup/restore resources contemporaneously.

After opening the directory and locating the ActiveOwner files for the failing databases, we noticed they were all 0 KB, while healthy ActiveOwner files were 1 KB and contained the name of the owner DPM server.
1. Open  "c:\program files\Microsoft Data Protection Manage\dpm\activeowner" on the database server. 
2. Rename any 0 KB files to <name>.old 
3. Run SetDpmServer from ""c:\program files\Microsoft Data Protection Manage\dpm\bin"Syntax: SetDpmServer.exe -DpmServerName <SERVER>
Replace server with the computer name of your DPM server. 
4. Re-run your synchronization
This fix literally takes 3 minutes, yet it took us an entire week of investigation to come to this conclusion.

9 comments:

  1. James,

    Awesome. Thank you so much for posting this. I too have spent almost two weeks trying to get my AdminContent database to back up. You found the solution. Great job.

    -CR

    ReplyDelete
  2. Thank you.. I have been looking at this problem, off and on, for over two weeks now... I should know better and searched Google earlier

    ReplyDelete
  3. Thanks, there was a few SP db's I was having this trouble with. Looks like a challenging one to find.

    Cheers

    ReplyDelete
  4. Thanks man! you saved me SO much time...

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Worked like a charm - spent a whole day trying to figure it out, until I ran across your article

    ReplyDelete