Home » Skype for Business » Skype for Business – WARNING: Standard Edition Pool Failover Disaster

Skype for Business – WARNING: Standard Edition Pool Failover Disaster

I have been setting up a Standard Edition pool pair for disaster recovery for a customer and wanted to share my experiences around failover. The deployment and migration of services, users and data from the legacy Lync installation went absolutely fine and without issues. I successfully paired the two Skype for Business Standard Edition servers together, both the backup and replication services were happily synchronising data.

A few prudent powershell commands to prove correct replication returned all values as expected. Once I was happy with the configuration I wanted to perform a controlled failover from the primary to the backup pool including users and CMS to prove the failover process worked as expected. At this point I would like to thank Chris Hayward (@WeakestLync) for warning me of a potential issue during failover that screws up your CMS.

It turns out that when performing the failover, Skype for Business leaves the CMS active on both Servers! However, this is not immediately apparent or clear and I wanted to detail my experience in identifying that this is the case and what I had to do to resolve this issue. I don’t have any screenshots of the problem because I was too busy trying to resolve it, so will do my best to explain.

Failing Over

Performing the failover, I followed the steps listed on TechNet (https://technet.microsoft.com/en-us/library/jj204678(v=ocs.15).aspx) as they have worked fine in previous versions and is the same process for Skype for Business.

When running the Invoke-CsManagementServerFailover commandlet with the –Whatif parameter the results showed correctly that the CMS was on the primary server and would be failed over to the backup pool server.

Running Get-CsManagementStoreReplicationStatus returned TRUE for every server in the topology.

Running Get-CsManagementStoreReplicationStatus –
CentralManagementStoreStatus returned the primary server as the Active Master and Active File Transfer Agent with the backup server listed in the Active Replicas list as expected

Running Get-CsService –CentralManagement showed that the primary server was active for the CMS and the backup server as false as expected

Downloading the current topology showed the primary server as the active CMS.

Running Get-CsBackupServiceStatus –PoolFqdn fe1.domain.local returned the server as in a Normal State and the same for the backup server.

To ensure that the CMS was properly up to date on both servers I then ran the Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.com and checked for any replication issues in event viewer and by using CLS Logging using the HADR scenario. Everything looked positive

One last invocation to ensure servers where up to date was to force replication to the RTCLOCAL databases on each server by running Invoke-CsManagementStoreReplication command

Once I was absolutely sure I was in a position to test this by re-running the Get commands above to triple check everything I decided on Chris’s advice to take a backup of the XDS, and Lis databases, just in case.

Export-CsConfiguration –Filename c:\cms.zip

Export-CsLisConfiguration –Filename c:\lis.zip

Now I went ahead and followed the TechNet procedure by setting the Edge server next hop to the backup server using Set-CsEdgeServer –identity edgepool.domain.local –Registrar fe2.domain.local command.

Next, ran the Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe2.domain.local –BackupSqlInstanceName RTC –Force

Here is where the problems started…

When failing over the verification process was failing to verify the CMS on the backup server with the following error:

“Backup Central Management Store state is Active, the expected status is Backup. Note that if the local replica is out of date, the topology document may be obsolete. Ensure that the local replica is up to date, and run Test Management Server Cmdlet. Central management server verification failed. Verification execution will be retried once a minute for 14 more minutes. Since Failover has already finished, the user can press Ctrl + c to end the current verification task at any time, and Failover will not be affected”

I let all the retries complete but none were a success.

I then ran the following commands to see what had actually happened and what state the CMS is in at this moment.

Running Get-CsManagementStoreReplicationStatus
did not return any values at all

Running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus
did not return any values at all.

Running Get-CsService –CentralManagement showed that the backup server was the ACTIVE server for the CMS

Running Get-CsManagementConnection returned the primary server as the ACTIVE CMS

Downloading the current topology showed the primary server STILL as the ACTIVE CMS.

Running Get-CsBackupServiceStatus –PoolFqdn fe1.domain.local returned the server as in a Error State and the same for the backup server.

So I double checked the properties of the Active Directory Service Connection Point (SCP) for Skype for Business using ADSI Edit under the Configuration context

CN=<topology guid>,CN=Topology Settings,CN=RTC Service,CN=Services,CN=Configuration,DC=domain,DC=local

The msRTCSIP-BackEndServer attribute was set to the primary server fe1.domain.local/RTC

At this point I did a lot of panicking and head scratching, using various commands, restarting services etc to try and get the Active server to show the backup server and restart replication. By restarting the Replica Replica and File Transfer Agent services on both front end servers, I managed to get some results back from the following commands

Running Get-CsManagementStoreReplicationStatus returned all servers replication status as FALSE

Running Get-CsManagementStoreReplicationStatus –
CentralManagementStoreStatus returned values for the Active Replicas, but nothing for the Active Master Fqdn or Active File Transfer Agent Fqdn, so replication is never going to work.

Attempting to set the SCP using Set-CsManagementServer –Identity fe2.domain.local, although did update the SCP in AD, did not set this server as the Active Master or Active File Transfer Agent.

At this point there were no errors being reported in the Lync application log and users had full feature access.

I decided then to take a look at the XDS database in SQL management studio to see what that was reporting as the master server. So I opened the database and the table dbo.Component.

In this table it showed 3 entries – I was expecting only 2 as I have only 2 CMS servers!! The entries showed the following

Fqdn Component Registered
Fe1.domain.local Master 0
Fe2.domain.local Master 1
Fe1.domain.local Fta 1

How it should have looked

Fqdn Component Registered
Fe1.domain.local fta 1
Fe2.domain.local Master 1

So at this point it looks as though the XDS database ACTIVE ON BOTH NODES. Knowing I had a backup of this already I decided that I would try and manipulate this table to turn it back into the expected state. What a bad move that was and only made things worse by adding a new line entry like so:

Fqdn Component Registered
Fe1.domain.local Master 0
Fe2.domain.local Master 1
Fe1.domain.local Fta 1
Fe1.domain.local Master 1

Now faced with the total loss of the CMS database I had no choice but to revert my changes and restore the CMS from the backup. The below process details my recovery steps:

  1. On the primary server ran the following command Set-CsManagementServer- Identity fe1.domain.local to update the SCP back to the primary server
  2. On the primary server ran the Install-CsDatabase –CentralManagementDatabase –SqlServerFqdn fe1.domain.local –ForInstance RTC –Clean
  3. On the backup server ran the Install-CsDatabase –CentralManagementDatabase –SqlServerFqdn fe2.domain.local –ForInstance RTC –Clean
  4. Stopped the replication services and backup service on both servers
  5. On the primary server ran the Import-CsConfiguration –Filename c:\cms.zip to import the CMS data from my backup
  6. On the primary server ran the Import-CsLisConfiguration –Filename c:\lis.zip to import the CMS data from my backup
  7. Ran Enable-CsTopology
  8. Launched the Skype for Business Deployment Wizard and then ran Step 1 to reinstall the Local Configuration store using the data from the CMS on the Primary Server
  9. Ran Step 2 Install / Remove components on the primary server
  10. Launched the Skype for Business Deployment Wizard and then ran Step 1 to reinstall the Local Configuration store using the data from the CMS on the backup Server
  11. Ran Step 2 Install / Remove components on the backup server
  12. Ran Get-CsManagementConnection showed the primary server as the active node
  13. Ran Get-CsService –CentralManagement showed the primary server as the active node and the backup as false (expected)
  14. Started the backup and replica services on both front end servers
  15. Ran Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.local
  16. Ran Invoke-CsManagementStoreReplication
  17. Ran Get-CsManagementStoreReplicationStatus and the results returned TRUE
  18. Ran Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus and the active master and active file transfer agent was now set to the primary server
  19. Event viewer showed no errors and replication is now happening OK

So the biggest lesson learned here, take a backup of the CMS before failing over the pool just in case this happens to you. Without it I am not sure I would have still been in a job!

Workaround Theory

As I am not the only one who has experienced this issue, it could be a problem with Skype for Business itself. I feel that if I try and failover the CMS again the same problem will occur. So I have come up with a theory that I am going to attempt to qualify in a lab, but welcome any suggestions

1. Create a daily backup of the XDS and Lis Databases and store them on the backup pool server (done with PowerShell) something like this to give me 5 points of recovery

# CMS Backup Script workaround
#Set Backup Locaton
$backupfolder = \\fe2.domain.local\CMS_BACKUP
#Days to Keep
$retention = “5”
#backup file names
$date = Get-Date -Format dd-MM-yy
$cmsfilename = “CMS-$($date).zip”
$lisfilename = “lis-$($date).zip”
#backup store cleanup
$limit = (Get-Date).AddDays(“-$($retention)”)
Get-ChildItem -Path $backupfolder -Recurse -Force | Where-Object { !$_.PSIsContainer -and $_.CreationTime -lt $limit } | Remove-Item –Force
Import-Module SkypeforBusiness
Export-CsConfiguration -Filename “$($backupfolder)\$($cmsfilename)” -ErrorAction SilentlyContinue
Export-CsLisConfiguration -Filename “$($backupfolder)\$($lisfilename)” -ErrorAction SilentlyContinue

(Export-RgsConfiguration too if you have these setup)

2. When failing over to the backup pool perform the setting of the edge server(s) next hope and invoke-CsPoolFailover to fail the users across.

3. Then repeat steps to reinstall the CMS to the backup server in a clean state and then reset the SCP. At this point Skype for Business should (in my mind) treat the backup server as the master

4. When failing back repeat the process on the primary

I guess the best method here is to move the CMS database to a SQL cluster away from the Standard Editions and probably going to be the recommendation from me to my customers moving forward.

Anyway, the moral of this story is that make sure you have a backup and make sure you test (but be aware of this issue) failover in a controlled manner before having to rely on it for real. If anyone has any suggestions, want to share their experiences or receives information from Microsoft about this please share in the comment section below.


  1. Glad to hear I’m not the only one that had this happen. I was able to resolve it with this:
    invoke-csmanagementserverfailover -backupsqlserverFQDN -backupsqlinstancename rtc -force
    YMMV, but it’s what fixed it for me.. after a very panicked two hours of banging my head against it.

  2. Same here on an EE pool pair – we have an always on availability group in site A and a single SQL server in site B (DR) with our CMS DB. Same type of issues happened during testing failover. This might not just be SE pools, and likely needs a patch somewhere IMO. I am surprised there isn’t more chatter about this given how long the product has been out.

  3. I’m thinking that we aren’t seeing more chatter about it because people just aren’t using Pool Pairing and those who are haven’t tested the failover.

  4. I had the exact same issue yesterday. First time I set up SE pool pairing on skype4b so far, everything else I have done was enterprise edition. Your article allowed me to recover it to a working state, since I could invoke-csmanagementserverfailover until I was blue in the face but it’ll always end up saying that the SCP in AD for the active CMS didn’t match what it expected, and that both copies were active. Thankfully I had taken a full backup in the morning of everything before I attempted failover (by pure chance, not because I thought this would fail).

  5. You shouldn’t have used the -force paramater in the invoke of the CMS, because the Primary CMS was still active and reachable. -Force must only be used when the primary pool that hosts the CMS is unavailable. (TechNet: Invoke-CsManagementServerFailover -Force: You should not use the Force parameter if you are running the cmdlet for purposes other than disaster recovery, as it will not account for replication during the failover. When the parameter is not used, the cmdlet will first make sure all replications are done, then set the source DB to read-only mode.)

    • Even when the primary is dead, it fails over to the secondary OK, bring the primary back online and failback over, without the -force parameter it fails. for me anyway.

  6. Has anyone actually logged a case with MS on this? I thought I was off my nut when I hit this about 4 months ago. Assuming my memory is working correctly, my resolution was to 1) shut down the FE nodes of the failed over pool, and 2) shut down failed pool SQL. Wait a bit and then bring SQL back up again, and then the FE’s, and normality restored. Was hoping CU1 was going to have resolved this major road block, but nothing about it in the KB’s, and as if I’m going to retest my process, whatever it was, I may have just been lucky.

    • I reported to the product team. But fell on deaf ears. People say not to use the -backupsqlserver parameter when both servers are online, but there are a couple of issues with that. One the command wants us to specify the parameter (mandatory), 2) even killing the primary server and failing over, sometimes worked for me, most of the time not. Failing back is the issue. when the backup (now primary) and backup (was primary) are both online. That fails 100% of the time for me.

      • That’s also my experience, I’ve been able to reproduce the same issue on different pools 100% of the time. As things stand I’m not even recommending pool pairing anymore, as it’s a total hit and miss to recover from a failover at the moment. Funny thing is that it used to work fine in Lync 2013…

    • Hi

      Yes, i checked the technet article. but technet is not always correct. When trying the command, it would not execute without specifiying the backend sql destination (as in, it prompted you for it and would not continue without it), that’s the point. Haven’t tried in CU1 yet. hopefully it is fixed.


  7. I performed an invoke-csmanagementserverfailover in a controlled failover, (not sure why you were using the -sql switches, they’re only supposed to be used in the event of a DR where the primary is down), and it was successful. Woohoo. So, I’m thinking something was fixed in CU1. After failing over (technically, failing back) I also had to rerun the “Setup or Remove Skype for Business Server Componant” in the deployment wizard. I tempted fate yet again and invoked back, and was successful again, still took 4-6 mins with warnings about the CMS in AD doesn’t match the one in topology, and how it will retry again every minute for another 14 minutes. You can Control+c to end this “verification” process, I choose not to do that…

    I treat this as a good sign. Just wished they mentioned it in the CU KB’s somewhere.

  8. Pushed my luck, tried it again in order to test actual user functionality. Failing over to the DR site was fine, but when I was done testing, the failback of the CMS failed like above. I tried your process unsuccessfully a couple of times before figuring out that I needed to break the SQL Mirror for the XDS and LIS databases. (XDS for sure, couldn’t run the -CLEAN job with it in a mirrored state). To much fun for 4 am…

    So it’s hit or miss that a controlled fail over will work or not, or the failback. The only thing I maybe did differently is after I failed over, I didn’t run an Enable-CsTopology, which I had in the past for giggles.

  9. I had something similar on an Enterprise paired pool where for some reason I ended in a similar state for FTA and Master agents.

    I noticed that the Pool State wasn’t active for my pool that I brought up after the DR test so I just tried it this way and it worked:

    Set-CsRegistrarConfiguration -Identity XXXXX -PoolState active

    The Master and FTA agents picked it up from there and properly maked the pools as primary and backup afterwards.

  10. Hello Guys,

    This is pretty great and I think I’m in the same situation. however, My issue is more closer to Instance as I have default instance instead of RTC.

    Lets assume you using default instance in this senerios, what would be your –BackupSqlInstanceName value ? is it default or leave it blank as default or MSSQLServer doesn’t work either way?

    Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe2.domain.local –BackupSqlInstanceName “” –Force

    Many Thanks

  11. I hit the same problem. I’ve used your process to restore from backup (fortunately we have a script that makes nightly backups using the export commands on all customer servers).

    Once restored I had a look in the working databases and I think I could have recovered if I had a working one to compare with:

    dbo.Component — ‘how it should have looked’ — rtc on your fe2 (what is now active as it’s failed over to it) should have the two entries but *both pointing to itself*. On fe1 that table has *no* entries.

    That might have been enough, don’t know, but I also got messages about it being in migrating state so checked some of the other tables. dbo.DbConfigInt has a value CurrentState. That needs to be 0 on the active server (FE2 in your scenario) and 3 on the backup server. In mine it was 1 on one and 3 on the other. Wish I’d tried 0!

    Also dbo.Batch might be relevant, the PartialVersion value had one entry at 2 and the other at 3.

    I edited the SCP manually via ADSIEdit, that’s easy enough.

    Thanks for your article.


  12. MSFT really do need to sort this out! I could not believe it when I ran into this issue myself, its absolutely crazy this was not tested by MSFT or if it was not robustly enough I may add…

    • I managed to worked around the issue:

      1. Active CMS is down, failed over to Backup CMS using -backupsql params with -force
      2. Completed failover… all good

      Graceful Failback:
      No the part where we need to failback that we have all these issues
      1. Bring up failed Pool, check pool01 helath, check CMS Replication and Backup Service Sync status is healthy
      2. Graceful failover of CMS from Pool 02 to Pool 01 – hit the same issue as described in this blog!!
      At this point I had to do other stuff so returned to this after about an hour and still the same state….
      3. Performed a graceful CMS Failover back to Pool02 (from Pool01 to Pool02) left it for 30 minutes and checked health of CMS Replication and Backup Sync status of both pools = all good
      4. Performed a graceful CMS Failover from Pool02 to Pool01 all good ? CMS Replication good, Backup Sync Status Good… ???

      I will repeat these steps to ensure it was not a fluke…it would be good to see if the steps above helps any one else ?

      • Hi Anon, We are also experiencing this issue and has caused major problems. We have now recovered but have little faith. Did you get any consistency in your testing?

  13. I had this issue and it seemed to be because the SQL Express on the two Standard Edition servers wasn’t accessible from the opposing server.
    Once I allowed both the SQL Server Browser service and the sqlserver service for the RTC instance through the Windows firewall on both servers the failover worked ok – it still came back with the same warning initially but after 5 minutes of retrying it succeeded.

  14. After once having a stable 3x server Lync Standard Edition 2013 environment for several years and following an in place upgrade to Skype for Business which was successful but following testing of our failover procedure, it has completely destroyed any faith in the reliability of the system. We have had Microsoft working on this issue as we could not get our CMS to failback to primary and having Microsoft literally spending many hours in trying to recover the problem which just got worse causing complete disruption in the end we had to resort to disaster recovery procedures and recover of the whole infrastructure, which at least got us back operational. We still have the case open with Microsoft whom, by the way, do not acknowledge this issue as a known issue, despite highlighting this article and several other references. I am now in a loss of where to go next other than try again and hope for the best. I certainly would not recommend moving from Lync 2013 until this is resolved.

  15. Had this problem (sort of) again.
    Latest CU and Windows Firewall is off so it’s not Edward’s solution.
    Symptoms this time:
    – This is two Standard Edition Servers Pool Paired
    – CMS Failover worked
    – CMS Failback failed
    – Get-CsManagementStoreReplicationStatus shows Failed for all servers
    – Get-CsManagementStoreReplicationStatus -CentralManagementStore shows all the replicas but no master
    – Get-CsManagementConnection returns server 1 from server 1 or server 2 from server 2
    – Get-CsService -CentralManagement returns server 2, as does the Topology for the CMS location

    However the SCP in ADSIEdit (msRTCSIP-BackEndServer value, see the original post above) was pointing to server 1.

    Fix: go into ADSIEdit and point the SCP (see Mark’s original post) to server 2. Ensure all Skype services on both servers are started.

    Now trying Invoke-CsManagementServerFailover -BackupSqlServerFqdn -BackupSqlInstanceName RTC -Force:$false

  16. Another variety!

    Failover OK. (FE1 to FE2)

    Failback — Invoke-CsManagementServerFailover (with no parameters) fails because ‘FE1 is still active’ it says. But all the commands referred to above are fine, on both servers (they all say FE2 is active).

    The fix is to use SQL Server Management Studio and connect to FE1\RTCLOCAL. Open xds – dbo.DbConfigInt and change CurrentState from 0 to 3.

    Running Invoke-CsManagementServerFailover and waiting (don’t hit CTRL+C) then worked fine — verification succeeded in this case after 5 minutes.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: