Friday, 17 August 2012

Techniques to avoid Citrix Xendesktop boot storms

In any environments running Citrix Xendesktop with a PVS configuration, sooner or later you are likely to come across a boot storm, a mini one at least.

A boot storm is essentially a denial of service. It occurs when multiple Xendesktop or Xenapp servers reboot simultaneously and use all the available resources (normally CPU) causing extreme slowness in the rest of the environment. In some cases this initial boot storm can flow through for the rest of the day as your virtual infrastructure never recovers from the initial resource demand.

As you could imagine in a educational environment this can be amplified as X number of users log off at the end of each lesson then expect to log in 5 minutes later when their next lesson starts.

An easy fix would be to simply disable any "reboot on logoff" functionality but that can have its own implications.

For example your PVS environment may use a write cache redirect to local storage, this gives improved performance as it is less reliant on network infrastructure but is usually smaller in size. If your system was not rebooting on every logoff there is increased potential for the write cache to become full and with Xendesktop 6 the write cache overflow is on the PVS HDD itself.



Battling the infamous boot storm without changing write cache settings
After analysing our environment we decided we wanted to disable "reboot on logoff" but defiantly wanted to keep our "Cache on device hard drive" write cache configuration.  These two don't really work together by default, but these few changes allowed us to make them work perfectly together.

Part 1 - Daily reboot

Firstly we configured a daily reboot, we found the easiest way to do this was through a combination of script and Citrix policy.

Through our Desktop Group properties we have configured our Power management schedule to slowly start turning machines off around 12AM, coming to a low of 0 between 3 and 4AM. Then machines started  turning on again, allowing us to peak back our at 100 machines at 7AM, ready for staff and students to login at 8AM. By slowly turning these machines on/off we ensure we don't trigger a boot storm.

To compliment our Desktop Group Power management configuration we also set a simple task schedule on the virtual desktops themselves at 3:45AM to trigger a reboot. At 3:45AM there should be no more than around 5 machines still running, ensuring we are only rebooting a few machines at this time. This might not suit all environments, but in ours there is never going to be anyone using Citrix at 3:45AM.

That ensures at least 1 reboot a day clears the write cache.


Part 2- Write cache evaluation on logoff

Part 2 is slightly more complex but just as important in ensuring the write cache has space available.

Using the kixstart scripting language and a logoff script, we run an evaluation on available write cache space. Depending on the outcome of that evaluation we trigger a reboot or simply just allow the system to log off.

This will ensure any systems that have below a specified threshold of available write cache will reboot and the rest will be immediately ready to serve the next user. I have attached the evaluation script below, written in the kixtart language.

writecachemonitor.kix

This script can be triggered by a simple windows domain logoff user script targeted to the virtual desktop OU with loopback.


These two simple techniques have worked wonders for us, not a single boot storm, our performance has noticeably increased and end-users are much happier.