Tuesday 12 November 2013

Citrix, Windows 7 Thin PC, Thin kiosk and Wireless limitations

We have decided to go with a Xenapp model with predominately wireless devices. The challenge is how to get Windows 7 Embedded, write filtered, non-domain joined devices to connect to a wireless network and work smoothly. Our solution was to provision an open wireless network and present only a pair of DNS servers and a Citrix VPX cluster into the wireless network. This is effectively an internet style DMZ network that allows our internal wireless and BYOD machines to all connect in the same manner.

Thinkiosk is a great free product from Andrew Morgan that provides a locked down environment for launching VDI. As part of our adoption of Xenapp we wanted to modify our existing Windows 7 Thin PC image to support Thinkiosk in wired and wireless configurations.

Thinkiosk works great in wired deployments, but there are some limitations with wireless. If you are using Thinkiosk in an auto login, non-domain joined scenario, Thinkiosk launches very quickly, often before the wireless connection has initialized.



Making Thinkiosk work smoothly with wireless

To get around the existing limitations we have created a vbscript that waits for the Thinkiosk URL to become available before launching Thinkiosk. After 24 seconds if a connection isn't established, the wireless control panel applet is launched. After 60 seconds if a connection to the URL still can't be established then Thinkiosk is launched regardless.

You can modify these thresholds and URLs very easily in the below.

The script is available from my pastebin here

To launch the script on logon, simply replace your windows shell with cscript and the script path as per below
reg add "HKEY_local_machine\Software\Microsoft\Windows NT\CurrentVersion\Winlogon" /v shell /t reg_sz /d "cscript c:\windows\thinkiosklauncher.vbs" /f



Supporting wireless persistence with write filters

The other major limitation with Wireless and Windows 7 Embedded is the ability to retain wireless configurations when write filters are enabled. Write filters do a great job of keeping a consistent device and lower support, but they can also be restricted if local changes are required.

We want users to take the devices home and connect in the same manner as they would if they on site. We have configured split-brain DNS so the Citrix URL is accessible both internally and externally. For this to work smoothly our users need to be able to connect to their home wireless networks and the settings need to be remembered. Could you imaging typing in your uber secure 64 character WPA key every time you turn the device on? Yuck, no thanks.

We have determined the following exclusions need to be added to support wireless persistence on write filter enabled machines.
File: c:\programdata\microsoft\wlansvc\profiles
Registry: HKLM\software\microsoft\windows nt\currentversion\networklist\profiles
Registry: HKLM\software\microsoft\wlansvc\interfaces
Adding the above exclusions in conjunction with Thinkiosk wireless support gives your users the ability to connect and remember wireless networks on their thin client. Depending on the write filter method you are using you will may a different command, but for file based write filters you can add the file exclusion as per below.

fbwfmgr /addexclusion c: "\programdata\microsoft\wlansvc\profiles"

The registry settings are a little more involved, we suggest reading the following blog to get some more insight on how that's achieved http://geekswithblogs.net/WallabyFan/archive/2008/12/24/everything-you-wanted-to-know-about-fbwf-but-were-afraid.aspx

Thinkiosk wireless client support can be enabled with the following registry key.

reg add "HKEY_local_machine\Software\THINKIOSK" /v ShowWifi /t reg_dword /d "1" /f

We have also added a line to the above Thinkiosk wireless launcher script to re-import our internal wireless network after each boot. This is just to ensure our users don't accidentally (or intentionally) delete our wireless network.

objShell.exec("Netsh wlan add profile filename=c:\windows\wireless-open.xml user=all")

The above wireless-open.xml configuration can be exported with the netsh wlan export tool and then re-imported on each boot or login to ensure your network is never permanently removed.

Thursday 31 October 2013

Publishing Server 2012 R2 Work Folders with UAG 2010 SP3 Reverse Proxy

We already knew before 2012 R2 came out that we were going to use Work Folders in a trial later in the year. As soon as the ISO dropped we began organizing our server environment to handle Work Folders synchronization. Of course the biggest part of this is allowing synchronization from home, so reverse proxying a remote access solution was a must.

We use UAG 2010 and while Microsoft may have a wizard in the upcoming SP4, unfortunately is not available yet for early adopters. Here are the steps we used to make it work.



Prerequisites

Before you start it is best to use split-brain DNS for a smooth and speed workfolders experience. Create a new DNS record on your internal DNS servers using the external DNS FQDN with a low time-to-live, maybe 15 minutes, point this to your internal workfolders server IP. Then create the same DNS record on DNS servers authoritive for your external records and point this to your UAG trunk.

The records will resolve something like:
Internal: workfolders.consoto.info 192.168.1.1 (Internal Work Folders server)
External: workfolders.consoto.info 180.0.0.1 (External UAG Trunk)



Step by Step
1. Open UAG and take an existing trunk (or create a new trunk) that has Trunk authentication disabled. The trunk we used had a wildcard certificate and it worked perfectly. 
2. Add a new application. 
3. Select "Other Web Application (application specific hostname)", click OK. 
4. Name the application "workspaces" and application type "workspaces", click OK. 
5. Select "Configure an application server", click OK. 
6. In "addresses" enter your workspaces URL, we are using split-brain DNS so the internal and external address will be the same. 
For paths enter "/sync/1.0/" as this is the only part of the Work Spaces server that the reverse proxy needs to forward. 
In public hostname enter workspaces, you will need to create the corresponding external A record (or CNAME record for existing trunks), click OK. 

7. Leave "use SSO" unticked, click OK. 
8. Un-tick "add a portal and toolbar link", click OK. 
9. Leave "Authorize all users" ticked, click OK. 
10. Click finish to create the application. 
11. Now we must make some modifications to the URL set, to do this click "Configure trunk settings" under the trunk. 
12. Click the URL set tab and find the "workfolders_Rule1" rule. The rule already will show the URL of "/sync/1.0/.*", we need to modify the methods. 
The default methods are POST and GET, add in DELETE, PUT and HEAD
Click OK to save the settings. 
13. Click the "workfolders" application and select edit. 
14. Select the Web Settings tab and click "Allow POST requests without a content-type header", then click OK to accept the changes. 
15. Save your changes and activate your configuration

Monday 21 October 2013

Aastra 6725ip POE problems with HP Procurve switching

We recently received a number of Aastra 6725ip handsets, the handsets immediately powered up via POE and connected to our Lync 2013 deployment. After I returned to work on Monday morning none of the phones were still connected to the Lync server, after rebooting them they no longer booted from POE.

This problem was introduced when they updated to newer OC Phone firmware editions 4.0.7577.4397 (CU9) and above. Although we can't confirm if the issue was introduced in firmware versions CU8 and below.

We tried a number of different Procurve switch models including other 5400 series switches (all fully populated with 900w POE+ power supplies) and 2900 series models, all experienced the same no power problems. The switches complained of not detecting an MPS signature and occasionally the switch also gave over current detection errors. If we used a POE injector or DC power supply the phone started without an issue. We also verified this was not an issue with POE power limits, pre-std-detection or LLDP.

After contacting Aastra they asked us to take a photo of the label underneath the phone. After the technician viewed the label they immediately offered to RMA all my phones but didn't give any reasons as to why they needed to be RMA'd.

After receiving the new phone I compared the labels and to no surprises the replacement unit is a Revision B compared to the previous Revision A unit.

While this is not in any way confirmed by Aastra, we theorize that in later editions of Lync 2013 phone edition some changes were made to the POE signature that caused incompatibilities with the Revision A Aastra 6725ip handsets and HP Procurve swit ches. If you are experiencing similar issues Aastra and very efficient at dealing with issues and promptly shipped me new handsets without any hassle.

The old handset





The new handset

Tuesday 8 October 2013

Lync 2013 client popup credentials are required for Outlook

When we started migrating users to Exchange 2013 and Lync 2013, users began complaining of being prompted to enter their credentials for Outlook when starting Lync.

The exact error message is:
Credentials are required
Lync needs your username and password to connect for retrieving calendar data from Outlook


We did some Wiresharking and found this to occur when Lync was connecting to the EWS and Autodiscover Exchange URLs.



The Solution

1. Ensure your internal and external EWS/Autodiscover URLs are in the Internet Explorer local intranet zone. This will be extremely helpful in assisting you to troubleshoot the problem.

2. Logon to your Exchange server and open IIS.

3. Click the EWS directory under "Default Web Site".

4. Open Authentication, click Windows Authentication and then Providers.

5. Remove all entries except for NTLM.



6. Repeat the same for the Autodiscover directory.

7. Restart IIS.

Now restart your client and you should no longer be prompted for Outlook credentials.



Additional troubleshooting tips

Internet Explorer debug tools are great for troubleshooting authentication issues, but first ensure your EWS/Autodiscover URLs are in the Internet Explorer local intranet zone.

A simple test is to go to the EWS URL (http://yourdomain/EWS/exchange.asmx) in Internet Explorer, it should automatically negotiate authentication and show you the service page. If you don't receive the "You have created a service" page then your single sign on is not correctly configured.

1. Open Internet Explorer and press F12 to launch the debug tools.

2. Click Network and "Start Capturing".

3. Now open your EWS URL: http://yourdomain/EWS/exchange.asmx

4. Click "Stop Capturing" and click "Go to detailed view"

You can see the order of the authentication providers and any issues that might have occured during authentication. You can then repeat the same for Autodiscover, remember if Autodiscover can't negotiate pass through authentication then it will prompt the Lync user for credentials.



Wednesday 2 October 2013

Integrating Office Web Apps 2013 with Exchange 2013 OWA and UAG gateway

Office web apps 2013 (OWA) is a great way to extend the functionality of Exchange 2013 Outlook Web App and give full Powerpoint viewing functionality in browser. When we built our new Exchange 2013 environment we wanted to offer Office Web apps internally and externally.

After building our environment we had no problems making OWA work internally with Outlook Web App, but externally we always received error messages from OWA.

"Sorry, we couldn't open this presentation because we ran into a problem. Please try again."



Our external Outlook Web App is accessed via a Forefront UAG trunk, however there were no error messages on the OWA, Exchange or UAG servers indicating a problem.

We ran Wireshark packet trace on the OWA server and found that when the request for the Powerpoint web app was established externally, the OWA server tried to establish a connection to UAG. This makes sense as the Powerpoint file needs to be transferred from Outlook Web App to OWA somehow and UAG stands in its way.



The Resolution

1. Create a simple A record in the hosts file of the OWA server.The A record should be the FQDN of your UAG trunk to the internal IP address of your Outlook Web App server, a load balancer address is fine here.

In our circumstance we needed to re-issue the Exchange certificates to include the UAG FQDN in the subject alternate name of the certificate. We tried without re-issuing the certificate but received the same "Please try again." error messages as above. As you would expect the OWA server is rejecting the certificate as it doesn't contain the correct FQDN.

You also need to ensure your OWA server has network connectivity to the HTTPS port of your Outlook Web App server.

While this is a simple fix, it did take us a while to even consider trying this. Now we have lots of happy users able to preview power point and word documents externally.

Tuesday 1 October 2013

Outlook & Lync 2013 prompting for authentication with Exchange 2013 Outlook Anywhere

We are currently undergoing a massive migration from Exchange, Lync and Sharepoint 2010 to 2013. The first service being migrated is Exchange, making way for Lync and later Sharepoint. During our Exchange migration it hasn't all been smooth sailing, before we even got started we had to wipe our DAG and start again. However with a little persistence we got a trial DAG up and running and migrated a couple test mailboxes.

As we got into the trial we found some users with Outlook 2013 had single sign on (NTLM) while others were prompted for authentication. This also effected Lync as the UCMAPI service connects to Exchange via Outlook, so Lync was also prompting for credentials when the user logged in. While a password prompt is acceptable for external use, we were not happy with this for internal devices.

Clients that were prompted for authentication were able to single sign on to rpcproxy and EWS with Internet Explorer, so the problem didn't seem server based.

Hours were invested trying to resolve this problem, so of the fixes we attempted were:

  • Packet tracing
  • IIS log tracing
  • Re-creating RPC and EWS directories
  • Changing authentication in the Outlook client
  • Changing authentication on exchange virtual directories
  • Changing IIS authentication
  • Adding exchange domains to local intranet zone in Internet Explorer
  • Bypassing load balancers
  • Client based registry hacks
  • Client application re-installation


The Resolution

We were nearly at the point of building a second Exchange environment for testing when we made a breakthrough, I almost feel embarrassed admitting the problem, group policy.

Our legacy Exchange 2010 group policy was set up for RPC internally and https Outlook Anywhere externally, however we never set the MSSTD field, in fact we implicitly removed it with GPO.

As soon as we set the MSSTD field to one of the SAN names of the certificate (wildcards certificates are also fine), the Outlook client single sign on worked perfected and Lync stopped asking for authentication on sign in.

Hopefully you don't waste as much time as we did on this problem.

Friday 6 September 2013

Exchange 2013 mail stuck in draft - Mailbox Transport Submission service failing to start

We just recently begun building an Exchange 2013 DAG to support out email environment. This is part of an internal shift to Lync 2013 and Exchange 2013 for unified communications. We followed Microsoft instructions on building my first 2013 machine, migrated a single test mailbox and started testing.

We found no mail flow between the Exchange 2013 and Exchange 2010 environments, all email sent from 2013 OWA or Outlook was just sitting in the drafts folder. In fact even if I emailed within mailboxes on the 2013 mail database they would also stay in drafts.

Upon investigating this problem there was a failure of the "Microsoft Exchange Mailbox Transport Submission", "Microsoft Exchange Transport" and "Microsoft Exchange Frontend Transport" services.



The Problem

Windows application event logs showed the following errors when trying to restart the "Microsoft Exchange Mailbox Transport Submission" service.
Log Name: Application
Source: Application Error
Event ID: 1000 
Faulting application name: MSExchangeSubmission.exe, version: 15.0.712.12, time stamp: 0x51aff4c9
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc00000fd
Fault offset: 0x000007fb0b3cc301
Faulting process id: 0x2894
Faulting application start time: 0x01ceaa8fd8cefada
Faulting application path: C:\Program Files\Microsoft\Exchange Server\V15\Bin\MSExchangeSubmission.exe
Faulting module path: unknown
Report Id: 179beb57-1683-11e3-93f5-00155dd80a77
Faulting package full name:
Faulting package-relative application ID: 

There is also another event log with some more telling information.
Log Name: Application
Source: MSExchangeTransportSubmission
Event ID: 7010 
The activation of modules are taking longer than expected to complete. Current state of components:<LoadTimings>
....
  <Component Name="Dns" Elapsed="00:00:23.1781923" IsRunning="true" />
</LoadTimings>
<StartTimings />
<StopTimings />
<UnloadTimings />
It is indicating that the DNS component was failing to start in a timely manner.



The Resolution

We checked and double checked our DNS, no issues. We also followed a number of blogs that indicate you need disable IPv6 and manually select the network adapter under DNS lookups on the Exchange server properties. We did this to no avail, however the actual resolution wasn't far off.
1. Open Exchange 2013 ECP
2. Go to "Servers" tab
3. Select "servers" from across the top menu
4. Double click the problematic server from the list of Exchange servers
5. Select "DNS lookups"
6. Instead of selecting your network adapter, select "Custom settings" under "External DNS lookups"
7. Add all your DNS servers
8. Repeat the same for "Internal DNS lookups"
9. Restart the above failing transport services
10. Rejoice


This seems to be a bug with a .NET 4 module Microsoft.Exchange.Net.ni.dll. Selecting "Custom settings" bypasses this .NET bug.

If you are still having issues with mail flow, ensure your new Exchange 2013 server has a system mailbox and consider removing the Exchange 2013 server, wiping AD clean and starting a fresh install.

Monday 19 August 2013

eVc I2C PWM controller

I recently received an exciting package in the mail, Jon Elmor Sandstrom sent me his latest creation, eVc. eVc is a USB connected I2C controller, capable of interfacing directly with PWM controllers that support I2C. In short this device gives the ability to change a number of PWM settings on the fly. This can help to push insane overclockers and troubleshoot problems with motherboards and graphics cards.

I have taken a closer look at eVc with a Gigabyte 7790 OC card, first with ambient air cooling and then with -40 degree Celsius on my single stage cooler.

The video below goes into detailing explaining how eVc works with a little demonstration of the clocks gained by a combination of extra voltage, load line and subzero cooling.

I am really looking forward to seeing how the future of this device pans out.


Wednesday 14 August 2013

James Trevaskis and G.Skill break the memory frequency world record with 4404 MHz

Since the launch of Z87 and Haswell there has been a big focus on memory clocking. We first saw memory frequency taken past the 4000 MHz mark by Gigabyte pre-launch. The record then bounced back and forward between Gigabyte, ASROCK and ASUS over the the last few months.

I decided to give the record a crack with the petite ASUS Maximus 6 Impact board and GSKILL 3000 MHz MFR kit and came up with some good numbers


Setup
ASUS Maximus 6 Impact
GSKILL 3000 MHz TridentX
Intel 4770K CPU
Corsair AX1200
Kingpincooling ram heatspreaders from Ney Pro and custom dominance copper pot
Liquid Nitrogen Cooling

The end result was a validated clock of 4404 MHz, smashing the current world record by over 100 MHz. Normally ram world records are decided by 5-10 MHz so this was a big achievement by the pint sized Impact.


Most overclocking records are respected within the OC community but ignored in the wider tech community, however this record gained widespread attention from the tech world. Here is a link to the ASUS ROG post about the record http://rog.asus.com/258552013/overclocking/4404mhz-ddr3-world-record/

Furthermore I was able to run SuperPI 1M, a very light stability test, past the current world record at around ~4300MHz.

Below are links to some of the Tech media that covered the record.
http://hexus.net/tech/news/ram/58797-gskill-reclaims-worlds-fastest-ddr3-memory-title-44ghz/
http://techgage.com/news/teamaus-youngpro-achieves-crazy-ddr3-4400-overclock-with-g-skill-tridentx-memory/
http://benchmarkreviews.com/3822/g-skill-tridentx-ddr3-overclocked-by-youngpro-to-4400mhz-with-ln2/
http://www.legitreviews.com/g-skill-tridentx-ddr3-memory-reaches-4400mhz-new-world-record_15854

Friday 2 August 2013

Windows 8 networks drives not mapping via logon script with UAC enabled

Many enterprises attempting to tackle Windows 8 may find themselves in the same predicament we did, whereby existing logon scripts fail to map network drives when UAC is enabled. Unfortunately this isn't as simple to fix as it was in Windows 8, well that isn't strictly true, it is easy to fix by disabling UAC, but you introduce another problem, metro apps no longer work.

For those uninitiated, metro apps require UAC enabled to work, and in most circumstances UAC breaks drive mappings.

The root cause of this problem is that drive mappings are occurring in the under the wrong permission token.

When UAC is enabled, metro apps work, but drives are mapped with the privileged token, so you can't see them under windows explorer but if you open a command prompt as administrator you will be able to access those mapped drives.

When UAC is disabled, metro apps don't work, drives are still mapped with the privileged token but windows explorer is also started with the privileged token, so the mapping are visible.

The trick is being able to leave UAC enabled so metro apps work, but still mapping the network drives in the non-privileged context so they appear in explorer. Chances are if you have a complex logon script with HKCU registry edits, you probably want some of your logon script to run as privileged user (registry) and some as non-privileged user (drive mappings).

We tested a number of solutions that all had some downfalls:

  • Disabling UAC and forfeiting the use of metro apps
  • Playing around with different combinations of UAC group policy options
  • Using the EnableLUA registry setting
None of the above could deliver mapped drives and working metro apps, however there is a solution.



The Solution

We use Kixtart scripting internally, so ideally we would like to keep these scripts when moving to Windows 8. I stumbled upon a function called RunAsInteractiveUser that was designed for Vista and Windows 7. The RunAsInteractiveUser function leverages windows task schedulers ability to run a task as the interactive user, therefor inheriting the unprivileged token if UAC is enabled.

This function allows programs to be launched as the non-privileged user when being launched from a privileged process on a system where UAC is enabled. 

We did a simple modification to our Kixtart script, when a Windows 8 system launches the logon script, the drive mapping section is ignored but is spawned with the non-privileged account. The code goes something like this.

;**** Windows 8 override/skips
if instr(@producttype,"Windows 8")
 ;if UAC is specified, only remap drives, skip rest of script
  if ($input = "uac")
    goto remap
  endif
  $RC=RunAsInteractiveUser("\\contoso.info\dfs\scripts\logon\logonuac.cmd", "", "c:\windows\temp\",1)
  ;exit
endif
We then use a batch file called logonuac.cmd to relaunch the logon script with a $input=uac argument, when this argument is specified the script only remaps the network drives and ignores the rest of the script.

1. GPO triggers user logon script with privileged token.
2. Logon script runs, if Windows 8 is detected it performs all actions with privileged token but skips drive mappings. The logon script is then relaunched with the $script=uac argument under the non-privileged token.
3. Logon script detects $script=uac and only maps network drives. After network drives are mapped, the logon script exits.

If you wanted to simplify you could move the drive mappings out of the main logon script into a separate script and use code something like.
if instr(@producttype,"Windows 8")
  $RC=RunAsInteractiveUser("\\contoso.info\scripts\logon\logonuac.cmd", ", "c:\windows\temp\",1)
else
    RUN '"\\contoso.info\scripts\logon\logonuac.cmd"
endif


This would map drives under Windows 8 systems with the privileged token and everything else as normal.

We spent days on this problem, hopefully this solution saves you time.

Alternatively if you are not running Kixtart and don't want to, you can look at the LaunchApp workaround script. You could also do something as simple as moving the logon script away from GPO and running it as a scheduled task on the local machine running as interactive user, this should also work.

Wednesday 24 July 2013

Recovering corrupt Adobe Flash CS6 FLA files

Adobe Flash CS6 is an application widely used in both in secondary and tertiary design classes. However Adobe doesn't support saving files to the network, in fact they don't even support installing CS6 on domain joined computers. This is short sighted of Adobe, surely a large component of their CS business is driven by education sales, all of which would implement some form of domain or networked environment. After having continued errors with saving Photoshop and Flash files to the network we implemented a policy stating that users should save to the local drive then backup to the network.

Flash FLA files are zips containing a number individual files inside. Often during the drawn out process of saving individual files and then zipping them, flash has a tendency to corrupt saves, especially when saving to network storage.

We recently spent time trying to recover a FLA file and came up with a process that can often recovery corrupt files.


The Solution
1. Install and open Winrar.  
2. Take a copy of your corrupted FLA file and open it with Winrar. File > Open Archive 
3. Repair the archive. This option is available from the Tools menu.

4. Choose a location to save the repaired archive to.
 

5. Use Winrar to open the file you saved in step 4.
File > Open Archive > select the location you saved the repaired archive to > select the archive 
6. Click "Extract To" and choose a location to extract the repaired FLA file to. This will unzip the FLA zip file into the individual files that are contained inside. 
Accept any errors, it may complain of CRC errors or say some files can't be extracted.  
7. Go to the location you extracted the archive to in step 6. Open the XFL file with Flash, it should have the same name as your original FLA file.  
On a computer with flash you should be able to double click it.



Your file should open damage free! Depending how damaged the archive is this may not work, however we found on most occasions it does and saves you a heap of time.

Thursday 18 July 2013

TMG 2010 Certificate Enrollment Fails - RPC server is unavailable

During some recent TMG 2010 maintenance, we noticed the TMG computer certificate had expired a few months earlier, this should never occur as our Active Directory CA should re-issue a certificate when expiration is near.

We first attempted to manually re-request the certificate via the Certificates MMC snap-in and were presented with an error saying "The RPC server is unavailable."
'

Our eventlogs suggested that a DCOM problem may have occurred. After checking DCOM and the RPC Service we were unable to uncover any issues.




The Solution

It turns out TMG itself introduces this error intentionally. The "Enable strict RPC compliance" setting, which is enabled by default, blocks the RPC functionality required for AD based certificate enrollment to work. Fortunately the fix is straight forward.
1. Open your TMG console 
2. Navigate to "Firewall Policy" 
3. Right click "Firewall Policy", select "All Tasks", select "System Policy" and finally select "Edit System Policy"
 

4. Under "Authentication Services" select "Active Directory 
5. Untick "Enforce strict RPC compliance
 

6. Click "OK" 
7. Apply the policy changes

You should of course do the appropriate research before disabling this setting, in some super high security environments you may not wish to disable RPC compliance, however in our environment it made no difference.

After waiting a few minutes for the policy changes to occur our problem was resolved, certificate enrollment once again worked perfectly.

Newegg TV interview for GSKILL at Computex 2013

I just thought I'd throw this Youtube link up. During Computex 2013 this year I spent a week doing liquid nitrogen overclocking demonstrations at the GSKILL booth with Hiwa Piori and Christian Ney. Our goal was to break world records and get the crowd having fun with liquid nitrogen and high end computer components.

Newegg TV dropped by to see what we were doing and ask a few questions about Haswell and overclocking. My (Youngpro) interview starts around 9:15.


Thursday 20 June 2013

Lync 2013 prompting for credentials after successful sign in

We recently migrated our environment to Microsoft Office 2013 which of course comes with Lync 2013. Through all our testing and validation groups we saw a number of problems and fixed them as they appeared, unfortunately one was missed.



The Problem

After our initial roll out we had a number of users reporting Lync 2013 kept prompting them for credentials at random intervals. Our testing found the following facts regarding this situation.

  1. Lync seemed to be working perfectly minus the prompting for credentials
  2. The problem would only occur if Outlook had been opened at least once,  the prompt is related to Exchange/Outlook integration.
  3. If "Personal information manager" is set to None, under the personal options menu, the problem goes away.
  4. The problem is related to authentication with our internal proxy. If proxy is disabled in IE the problem goes away, it re-appears when proxy is re-enabled.

Even thought this didn't affect Lync performance or functionality it was annoying for the user, disabling Outlook integration wasn't an option as presence is one of our key uses for Lync.



The Solution

We tried all number of fixes before we resolved this problem. If anyone tells you Lync 2013 CU1 fixes this problem, tell them to test it again, it doesn't, at least it doesn't when proxy's are involved.

In our circumstance this error was related to Lync hitting an internal proxy server.We tried adding the URL's we found below to the IE proxy exclusion list, but Lync 2013 seems to ignore this list.

Unlike Lync 2010, Lync 2013 doesn't use pass-through authentication, this is related to the use of win http as an authentication conduit. We did also try playing with win http proxy settings via netsh but we had no success. It is said that this problem will be resolved in a future update.

Initially we tried passing our OWA/Exchange related URL's through the proxy without authentication but still the credentials box appeared.

Out come Wireshark and proxy logs and we finally found all the culprit URL's, you can find them listed below.
  • All exchange OWA/EWS URLs
  • login.microsoftonline.com
  • clientconfig.microsoftonline-p.net
We allowed these to transverse our internal proxy without requiring authentication and the problem was solved, no more authorization prompts, no more frustrated users. I added http and https to the Microsoft URL's for good measure.

Wednesday 19 June 2013

Hiatus

Between the launch of Z87, to multiple projects at work and recently starting an IT degree, I haven't had much time to put into blogging.

Lately I've also had some personal projects such as building a reasonable sized Litecoin/Bitcoin farm, so far we are up to 7000 KH/s with another 2500 coming within a few weeks. I have an upcoming article for ABC Tech covering crypto currency, I can hopefully come at it from a different perspective than most other writers, by looking at the community from inside out.

I have a back log of blog posts to write shortly, the next up within a few weeks (I hope).

Cheers

Tuesday 30 April 2013

Microsoft Failover Cluster CSV Volume Disappear

We recently began experiencing some problems with the 3rd member of our Windows Failover Cluster. Our cluster consists of 3 servers running 2008 R2 SP1, running Failover cluster manager with a SAN backend. This SAN presents a number of Cluster Shared Volumes (CSV) to the servers, all of our data sits on these CSV's.

One afternoon our primary CSV went into redirected mode, this is a normal occurrence during backup operations, but no backup was scheduled and we were not able to turn off redirected mode. We had to schedule a short outage, fully power off the Hyper-V hosts and power back on. After a full investigation turned up nothing we put this down to an anomaly. Another 3 weeks later and the problem happened again, this time we scheduled a longer outage so we could investigate the problem more thoroughly.

During testing we discovered that one of the 3 hosts was causing the issue, when it was removed from the cluster, no problems, when it was in the cluster the CSV in question would randomly go into redirected mode. Logs of the SAN and Hyper-V hosts turned up nothing and all the cluster tests passed perfectly.

Unfortunately during our testing, we encountered a bigger problem. When bringing the faulty host back online for the 3rd time, the CSV itself disappeared on the 2 healthy hosts, the CSV was still visible on the 3rd host. We promptly removed the 3rd host from the cluster but the CSV did not reappear on the 2 healthy hosts.




What didn't work

We tried a number of processes to get the volume to re-appear.
  • Rescanning/refreshing in Disk Manager
  • Deleting and re-adding the CSV
  • Repairing the CSV
  • Restarting the Hyper-V hosts
  • Removing the faulty host from the SAN LUN zones
At this point we were a little worried, our primary CSV was displayed in windows as an empty disk (as above)  With the Failover Cluster tools we checked out the DiskSignature and we were greeted with a grim 0. 

Command: cluster resource VMs /priv
D  VMs                  DiskSignature                  0 (0x0)

Scanning the FailoverClustering event logs we turned up the following events:

Event ID: 1568
Source: FailoverClustering
Cluster physical disk resource 'VMs' cannot be brought online because the associated disk could not be found. The expected signature of the disk was 'F62FC592'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.

and

Event ID: 1568
Source: FailoverClustering
Cluster disk resource 'VMs' found the disk identifier to be stale. This may be expected if a restore operation was just performed or if this cluster uses replicated storage. The DiskSignature or DiskUniqueIds property for the disk resource has been corrected.

This was repeated over and over, the Cluster was trying to repair the problem but not having any success.




The Solution

After reading this thread we noticed in the last post a user mentioned "a Microsoft tech fixed the problem, the disk first sector was corrupted" We decided a partition table scan and re-write were worth a shot.

Using testdisk I was able to successfully recover the volume by first analyzing the disk for partitions then writing the changes.

I then re-wrote the disk signature (which I found in the FailoverClustering logs, as per above) to the volume using the below command.

CLUSTER RESOURCE VMs DiskSignature F62FC592

The volume then successfully came online, phew and all within my outage window!


Monday 8 April 2013

DPM 2012 SP1 replica inconsistent - datasource is owned by a different DPM server

Recently we took the leap of faith to DPM 2012 SP1 + Update Rollup 1. SP1 offers proper compatibility with SQL 2012 and Server 2012, products we have begun using within our organization  The initial install and management went without a hitch, in fact it went eerily too well.

The follow month however wasn't such smooth sailing, within a few days a number of SQL data sources belonging to two different protection groups began failing with "DPM could not run the backup/recovery job for the data source because it is owned by a different DPM server.". The error description went on to say the "Owner DPM Server: ." claiming "." owned the DPM job.

This was an unusual error to receive as there has only ever been a single DPM server within the organization  so the possibility of another DPM server owning the job was highly unlikely.


The problem in detail

The 5 data sources that were failing were all SharePoint data sources. We are using a Sharepoint 2010 Farm protection group (PG) and backing up any SQL resources that arn't covered in this PG with a simple SQL PG. "Sharepoint_Config" was one of the failing resources, as well as 4 SQL jobs "Application_Registry_Service", "Bdc_Service_DB", "Managed Metadata Service" and "PerformancePoint Service Application".

DPM complained that the "Replica is inconsistent" and attached the following detailed error description:
"The replica of SQL Server 2008 database SERVER\Application_Registry_Service on server.contoso.internal is inconsistent with the protected data source. All protection activies for data source will fail until the replica is synchronized with consistency check. You can recover data from existing recovery points, but new recovery points cannot be created until the replica is consistent.
For SharePoint farm, recovery points will continue getting created with the databases that are consistent. To backup inconsistent databases, run a consistency check on the farm. (ID 3106)
DPM could not run the backup/recovery job for the data source because it is owned by a different DPM server.
Data Source: SERVER\Application_Registry_Service
Owner DPM Server: . (ID 3184 Details: The file or directory is corrupted and unreadable (0x80070570))"

DPM suggests to take ownership, I attempted this, re-ran the consistency check and within 5 minutes received the errors messages again.

Our logs were also complaining of communication problems, which we initially put down to network issues, but this theory was quickly debunked as other data sources on the same server were successfully backing up.
FsmBlock.cs(178)        2DE6593E-B086-4002-9205-0A57B65BDC8E    WARNING    Backup.DeltaDataTransferLoop.CommonLoop : RAReadDatasetDelta, StatusReason = Error (StatusCode = -2147023671, ErrorCode = CommunicationProblem, workitem = 70aeaa93-b090-4a3c-bea0-c6fd1a1b4625)
01    TaskExecutor.cs(843)        2DE6593E-B086-4002-9205-0A57B65BDC8E    FATAL    Task stopped (state=Failed, error=RmAgentCommunicationError; -2147023671; WindowsHResult),

We also tried removing the Protect Group and re-adding it, checking SQL permissions, repairing the DPM agent, un-installing and re-installing the Agent, all of which failed.



The Solution

This was a very difficult one to troubleshoot, as SP1 was so new, no one had published details of experiencing similar problems.

Eventually we tracked down the issue to be ActiveOwner problems on the SQL server. The ActiveOwner files are located in "c:\program files\Microsoft Data Protection Manage\dpm\activeowner" on the server hosting the databases (SQL server being backed up). These ActiveOwner are used to manage ownership of databases, important for ensuring multiple DPM servers aren't attempting to backup/restore resources contemporaneously.

After opening the directory and locating the ActiveOwner files for the failing databases, we noticed they were all 0 KB, while healthy ActiveOwner files were 1 KB and contained the name of the owner DPM server.
1. Open  "c:\program files\Microsoft Data Protection Manage\dpm\activeowner" on the database server. 
2. Rename any 0 KB files to <name>.old 
3. Run SetDpmServer from ""c:\program files\Microsoft Data Protection Manage\dpm\bin"Syntax: SetDpmServer.exe -DpmServerName <SERVER>
Replace server with the computer name of your DPM server. 
4. Re-run your synchronization
This fix literally takes 3 minutes, yet it took us an entire week of investigation to come to this conclusion.

Wednesday 6 March 2013

Distributing Adobe Acrobat 9.x updates in an enterprise

I will save you my Adobe hate rant, but if you ever look at my twitter its well documented I am not a fan of the Adobe update model. It's slow, updates need to be run consecutively and it uses lots of CPU cycles and bandwidth.

I am sure administrators that are pushed for time just ignore Acrobat updates, after all, end-users will never notice the benefits of security patches, right? I prefer to play it safe and try to stay best practice where possible, I have chosen a simple Kixtart script to manage the update process.


The script explained

The code is very simple, providing a step-by-step update from 9.0.0 right up to version 9.5.3, the current version in the 9.x stream as of the time this article was written.

At very least you need to set the $repopath variable to a network location your user/computer accounts can access. You also must populate all the Acrobat .msp updates into the $repopath.

I am using a SCCM "Whether or not a user is logged on" deployment, this means the SYSTEM account is used to during the installation, resultantly I permission-ed my update repository to allow the "Domain Computer" group read access. I decided on installing from network as opposed to downloading all the updates local due to sheer size. The repo is 1.5gb and some computers may only need 100 MB of updates, adding an un-required load onto the network.

I won't paste the whole script here, but below is an example of the update process I am using. It checks the version, installs the next update inline, then checks the version again, repeat, repeat.
Install Update, Check Version
  if ($ver = "9.1.0")
    gosub installAcro911
    gosub acroVerCheck
  endif
Example update install
:installAcro911  ? "Installing Acrobat 9.1.1 upgrade"  SHELL '%comspec% /c msiexec /p "$repopath\AcrobatUpd911_all_incr.msp" /qn /norestart REINSTALL=ALL REINSTALLMODE=omus'  Copy ("generic.tch") ("$touchpath\adobeupgrade911.tch") /H  ? "Acrobat 9.1.1 upgrade complete"return

In the above code I use a "copy generic.tch" command, this is just an empty file I copy to the local file system, it allows me to quickly check the current the update level of Acrobat 9.x, you can remove this step if you wish.

I'm using the "HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall" registry key to check for the current version. I tried reading versions from files in the Acrobat folder and checking the Acrobat registry key but both were unreliable.

The script is available here from my github enjoy!

Tuesday 12 February 2013

Decreasing Windows 7 and Xendesktop logon time

A windows domain environment provides many benefits such as group policy, the ability to deploy software and customize users settings. With the benefits also comes increased logon times and lag associated with automation, GPO and drive/printer mappings.

In a contemporary situation when a user logs onto a machine for the first time their profile is created and during future logons their is no requirement to re-apply settings/group policy unless the policy has changed. Citrix have addressed the situation of profiles within Xendesktop environments with their Citrix Profile Manager (CPM). CPM can be perfect for some situations, but not all. What if you want your users settings sanitized after every logon? What if you need a clean slate or don't want to manage/delete problematic profiles as they arise?

If you go without CPM then logons are invariably slower due to the re-creation of %userprofile% and collation of policies into HKCU on every logon. This is where creating a custom default profile can be handy. If you pre-create the profile and then remove some of the windows customization stubs, you can cut valuable seconds off your logon time.

For example, my default Xendesktop logon time was around 1 minute for users without CPM. Once I added a custom default profile it dropped to around 45 seconds and removing some of windows default customization stubs dropped the time even further to 40 seconds. If a 20 second reduction doesn't sound like much, just ask your end users that have to endure an eternity of windows welcome screens taunting them with the a never ending spinning circle.



Creating a custom default user profile

Microsoft suggest using their copyprofile unattended.xml method which does work well and is the only Microsoft supported method of overriding the default user profile in Windows 7. Unfortunately for those users that already have a working and highly customized Citrix vdisk, the thought of sysprepping might not be  most welcome idea.

The other method is to do an old fashion override of the default user profile. However there is some caveats with this, it's unsupported by Microsoft and there can be issues such as the My Documents folder being named the same as the account from which you overrode the default user profile with. I have found no such issues with my Xendesktop profiles and I used the override method. If you do chose to use this method, please test robustly.

An extremely handy tool for the override method is Windows Enabler, it un-greys out (for lack of a better term) the "Copy to...." profile box under the Windows User Profiles control panel applet.



I would however suggest if you are using planning to use a customized default profile with a flat Windows 7 (non-virtualized) deployment you do use the copyprofile method, perhaps as part of your SCCM/MDT deployment process.



Removing customization stubs to increase logon time

Even more frustrating than waiting at the welcome screen is getting past it then realizing your going to have to wait another 15-30 seconds for Windows to "personalize your settings", this is where customization stubs come in.

Under the registry path "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\" there are a number of listed IDs in the format of "{2C7339CF-2B09-4501-B3F3-F3508C9228ED}". Within some of these ID's is a REG_EXPAND_SZ value named "StubPath".

When a user logs on, regardless of if the default user profile contains the required settings, any stubpath commands in this registry path are executed, costing you valuable milliseconds during logon. We can speed up the logon simply by removing the required stubs.

You can remove the stubs you want by searching for "stubpath" within the "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\" key, the "(Default)" value will tell you what the stub in question is responsible for.


Below is an example of some of the stubs I remove by simply applying the below .reg file to my vDisk.


Windows Registry Editor Version 5.00
;IE9
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\>{26923b43-4d38-484f-9b9e-de460746276c}]
"StubPath"=-
;Browser
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\>{60B49E34-C7CC-11D0-8953-00A0C90347FF}]
"StubPath"=-
;Themes
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\{2C7339CF-2B09-4501-B3F3-F3508C9228ED}]
"StubPath"=-
;MailNews
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\{44BBA840-CC51-11CF-AAFA-00AA00B6015C}]
"StubPath"=-
;WMP 12.0
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Active Setup\Installed Components\{6BF52A52-394A-11d3-B153-00C04F79FAA6}]
"StubPath"=-
You should test, test and then test again when removing any of these stubpaths as they can cause unintended consequences. In fact unless you are using a customized default user profile it is probably safest to leave stubpaths alone. 

For example if you remove the Windows Theme stub, without populating the default user profile, this will result in a classic theme (Windows XP style). It is the windows theme stub that handles apply's Windows Basic and if possible Aero themes during logon.


Lets hope these couple of simple changes can improve your users experience.

Friday 1 February 2013

IBM AutoLoader TS2900 DPM 2012 setup tips


At the end of 2012 I received a new IBM TS2900 autoloader. I didn't have time to set it up, so for a couple of months it was being used as a glorified tape drive.

Last month I got around to configuring DPM 2012 to use the autoloader and hit a few speed bumps along the way. Hopefully the below information can help you setup your device quicker.

Just for reference my platform is Server 2008 R2 running DPM 2012 SP1 with Update Rollup 1.



I can't see the Autoloader in DPM or Device Manager

Hold on didn't I just buy an autoloader, so why can I only see the tape drive in device manager and DPM? This took me a while to work out, but its really easy to resolve.

The tape drive must be set to random mode, if you are using sequential mode your autoloader is essentially a manually controlled tape drive. As soon as you set random mode, the autoloader is presented to windows. You can set random mode by logging into your TS2900 front end and following the below steps.

  1. Click on "Logical", located under the "Configure Library" menu.
  2. Select "Random" from the "Library Mode:" drop down box.
  3. Click submit.
If you are running sequential mode, you might see the below error messages when attempting to install the IBM driver. They are caused because random mode has not been selected and the autoloader is not being presented to windows, hence no device for the driver to install.

DBG:         install_exclusive.c, 1239: InstallVirtualBus: UpdateDriverForPlugAndPlayDevices failed - Update.
EXT: 0 -> -1: install_exclusive.c, 1249: InstallVirtualBus: status 0xe0000235.
DBG:         install_exclusive.c, 885: InstallVirtualBusByType: InstallVirtualBus failed
Program stopped prematurely due to error(s).If the debug flag was set, check debug.txt for details.


Driver Installion

Next you need to install the IBM driver from the IBM download centre. The one I am using at the time of writing this article is named "IBMTape.x64_w08_6233_WHQL_Cert.zip", but I have had success with non-WHQL versions aswell. There is a small tweak required to get the driver working with DPM, follow the below steps to install the driver correctly.

  1. Extract the zip and install the driver by clicking on "install_exclusive.exe"
  2. Wait for the driver installation to complete, then open up Control Panel > Device Manager
  3. You should see the "IBM TotalStorage 3572 Tape Library" under "Medium Changer devices", this is your autoloader. If you don't see it, try uninstalling the driver, rebooting and re-installing the driver. Also ensure you followed the above steps to enable Random library mode on your autoloader.
  4. Right click the "IBM ULTRIUM 5 HH 3580 TAPE DRIVE" (or your equivalent Tape driver), select "Update Driver Software".
  5. Select "Browse my computer for driver software"
  6. Select "Let me pick from a list of device drivers on my computer"
  7. Select "LTO Tape Drive"
  8. Install the driver and close device manager.


This process replaces the recently installed IBM LTO Tape driver with the default Microsoft driver. This is required as DPM (as of release 2012 SP1 Update Pack 1) doesn't work with the IBM provided tape driver. You still must install the IBM driver however, as the autoloader "IBM TotalStorage 3572 Tape Library" does require the IBM driver package to work.

Failure to replace the IBM drivers with the Microsoft drivers will result in the error message below.
The operation failed because of a protection agent failure. (ID 998 Details: The parameter is incorrect (0x80070057))
There you have it, a working TS2900 autoloader, be it with a few quirky work arounds, but since following the above steps mine hasn't missed a beat.



Thursday 10 January 2013

Citrix Xenserver 6.1 Xentools installation problems

I really do love the Citrix Xendesktop platform and products associated with it, but all too often Citrix have launch issues with their products. The latest issue is with Xentools 6.1 installer being a little dodgy and feature lacking (no PVS/VSS support) being a key problem.
I also experienced a number of issues upgrading from Xentools standard 6.0 to 6.1 on machines that didn't require PVS support.
One of the main issues that I had is what Citrix call "continous reboots with standard tools installation never finishing". Citrix provide the following explaination for the problem:
"This issue occurs when attempting to install the Standard Tools shipped with XenServer 6.1 into a VM that has no virtual network interfaces. A workaround is to create at least one virtual network interface, install the Standard Tools, and then remove the virtual network interface (if so desired)."
In my case there was in fact a virtual network interface and I was still having the looping. Eventually after around 10 reboots the installer simply hangs at the very start of the "installing drivers, installing guest tools screen" and goes no further.
Windows eventlogs give no clues and Citrix logs don't any useful information. After following the "Uninstalling the Xenserver 6.1 standard tools" steps from the CTX135099 Xenserver Tools Workaround Guide for 6.1.0, including removing all the Windows Driver packages manually, I still had no joy. In fact on some systems I couldn't remove the "Windows Driver Package - Citrix Systems Inc. (xennet) Net" package.
After scouring windows programs and featuring and removing everything guest/driver related I feel back to the trusted "wmic product get name" command to get a list of installed products. I found "Citrix Xen Windows x64 PV Drivers" which wasn't listed under Windows programs and features GUI. It seems the "Citrix Xen Windows x64 PV Drivers" package failed to uninstall or install properly and was holding up the xentools installation process. To resolve the problem is fairly simple.
  1. Boot into Windows and open a command prompt
  2. Issue the command - wmic product where name='Citrix Xen Windows x64 PV Drivers' call uninstall
  3. Reboot
  4. Rerun the xentools installation process
After following the above steps I finally had a working Xentools standard successfully installed with statistically reporting being sent back to Xencenter.

Remember if you want to install the legacy tools (with support for PVS and volume shadow copy) then your VM must have its platform:device_id set to 0001. You can read more about changing the device_id under the section title "Preparing to Install the XenServer 6.0.2 Hotfix 9 Tools or XenServer 6.1 Legacy Tools in a New Windows Vista, Windows 7, Windows Server 2008, or Windows Server 2008 R2 VM (for PVS or VSS Support)" of CTX135099.

Friday 4 January 2013

Mass local administrator password change tool for Windows servers/desktops

I have been a bad admin for a while and used a single complex password as the local administrator password for a number of my servers. This certainly isn't best practice and has been on my cleanup list for a long time.

I wrote a simple vb script that utilizes the famous pstools pspasswd utility, it's fairly dirty but it gets the job done.

The script reads the computer names from a file, picks a random 20 character password, changes the local administrator password with pspasswd and then logs the passwords to a CSV called "passwords.csv".

Syntax: cscript changepass.vbs <computernamelist.txt>

You can download the package from my Github

Thursday 3 January 2013

Monitoring HP switches with Nagios - OIDs and Trap Filtering

I am not the biggest fan in of HP PCM, even the paid version PCM+ is very clunky, has reasonably limited functionality and is expensive for what it is.

We recently had a situation where one of the power supplies on an edge switch died and we didn't receive any notifications as HP PCM failed to notify us. Luckily it was noticed quickly and there was no associated downtime, but things could of been much worse. This sprung me into action, what other options are available and how could I implement them.

My main goal is to receive all the warning/critical traps from my switches. I will also do some SNMP polling to log statistics such as CPU usage and chassis temperature. The beauty of traps is they are real time, so you don't have to wait for a polling interval to find out your chassis is too hot or one of the power supplies just died.



SCOM 2012 is not quite ready

One of the most anticipated network monitoring solutions was the native SNMP engine in Microsoft SCOM 2012. I could go on forever about the SCOM 2012 SNMP implementation, but let me point out a few key problems and reasons why I decided on Nagios over SCOM.

1. Every device you receive traps from must be monitored. This doesn't sound like a big deal, for most network devices its not, but what you may not realize is for SCOM to receive traps from a device it monitors, that device must be poll-able by SNMP. So take your average Unix box, you may want to send traps from it, but not run a SNMP daemon, not possible with SCOM 2012.

2. Another monitoring limitation is you can only receive traps from the same version of SNMP that SCOM is monitoring the device with. So if you are monitoring a Unix box with SNMP v3 for security and want to send traps via v1, this is not possible. Furthermore some (5400, 26XX series) HP switches send traps in v1, there's no way I am going to monitor switches in v1, even in read-only mode.

These are two key deal breakers that make Nagios a much more appealing option.



But isn't Nagios difficult to setup?

While Nagios can be a pain to initially learn and setup, once your have everything in place it really is easy to manage and maintain. I decided to go with Centreon, the Nagios front end GUI, that not only makes Nagios configuration easier but also gives some extra options. An even easier option is to use the Fully Automated Nagios (FAN) distribution, which as its name suggests is a pre-configured Nagios/Centreon distro.

I am not going to do a step by step guide on setting up FAN, Nagios or Centreon, but I do have some very useful information for those wanting to collect traps from HP switches with FAN.



The difference between trapping and polling

When collecting traps a "TRAP OID" is used.

When polling a HP switch for information with SNMP, you use separate OID's for every separate statistic you want to collect (CPU usage, Chassis Temperature, etc). When recieving information via traps, all information is received on the single trap OID.

This can create challenges surrounding how to trigger alerts based on the text sent with trap. Chances are if you are trapping "not-info" traps (meaning the switch only sends warning/critical traps) you may want to receive notifications for every trap. However if you do get repeated traps that you don't wish to receive notifications for, there is no way to ignore traps based on text with a default Centreon/Nagios "Catch all traps receiver".

Trapping not-info events on a HP switch can be enabled by one of the following commands. depending on your switch model:

snmp-server host 192.168.1.1 community public trap-level not-info
snmp-server host 192.168.1.1 public not-info



Receiving and processing traps with Centreon

By default Centreon only catches traps that have MIBs configured, however there is a simple guide on the Centreon wiki that should get you all setup. The above guide helps you create a generic OID that Centreon issues to any unknown/unmatched trap it receives, allowing you to generate alerts based on the traps.

What you will probably want to do is also configure custom traps for the OIDs that each of your switches use to send traps. This will give you some granularity in terms of notifications and filtering out traps that aren't important to that specific switch or switch model. You can do this in Centreon under Configure > Services > SNMP Traps. Below is an example of one of the traps I use for HP switches.


  1. Set the "Trap Name" as you wish, I used "CUSTOM-Switch-HP-Traps_1_5400"
  2. Set the "OID" as the OID your switch uses to send traps. I have listed the trap OIDs I have discovered below.
  3. Set a "Vendor name" as you wish, obviously HP Networks might be suitable for any HP devices.
  4. Set the "Output Message" as $* - This will ensure you receive all of the text in the trap in your notification.
  5. Set your "Default Status" as warning or critical, depending on how important it is in your environment.
  6. Tick "Submit result" so the status is passed to Nagios.
  7. Save the trap, then click on Configuration > Nagios > SNMP traps > Generate. (This last step is important, without it the SNMP daemon doesn't receive the updated trap.
That is a very basic trap definition that you can then attach to a passive Centreon service and receive notifications as the traps arrive.


List of HP trap OID's

These are the trap OID's I found while testing different HP switches, I would welcome any over that I may have missed. These can be used in conjunction with the above Centreon custom trap guide.


.1.3.6.1.4.1.11.2.3.7.11.50.0.2 - 5406 trap OID
.1.3.6.1.4.1.11.2.3.7.11.51.0.2 - 5412 trap OID
.1.3.6.1.2.1.105.0.1 - 5400 some POE trap OID
.1.3.6.1.4.1.11.2.3.7.11.87.0.2 - 2610al trap OID
.1.3.6.1.4.1.11.2.3.7.11.44.0.2 - 2650 trap OID
.1.3.6.1.4.1.11.2.3.7.11.76.0.2 - 2610-24 trap OID
.1.3.6.1.4.1.11.2.3.7.11.23.0.2 - 4180-gl trap OID



Adding filtering capability to Centreon traps

So you have Centreon setup, you are receiving traps from your HP switches and they are triggering Nagios service changes and notifications. Unfortunately you have this one annoying trap that triggers twice a day as a warning trap and you get a notification every single time, what can you do to stop it? Well with a default Centreon install nothing, but since when did we do anything default.

Just so you understand the back end a little better, the traps flow as follows.

1. Your switch or device generates a trap and sends it to the SNMP daemon running on the Centreon/Nagios box.

2.
(a) If the trap is known the daemon forwards it to the /usr/share/centreon/bin/centTrapHandler-2.x.

(b) If the trap is unknown and you have not configured a catch-all trap, the trap is dropped.

(c) If the trap is unknown and you have configured a catch-all trap, the trap is forwarded to the Centreon      unknown trap handler (/usr/share/centreon/bin/snmptt2TrapHandler.pl).

It is then given the generic Centreon trap OID of .1.3.6.1.4.1.2021.13.990.0.17 and passed back to  centTrapHandler-2.x.  

3. centTrapHandler-2.x does the processing of the trap and passes valid information to any matching Centreon services and then Nagios.

The default centTrapHandler has no way of filtering any traps, all it cares about is matching the trap to a service and changing the status of the service.

I have made some minor modifications to the /usr/share/centreon/bin/centTrapHandler-2.x which is the trap handler Centreon uses before generating Nagios alerts. These modifications allow traps to be discarded under specific circumstances and also allow for logging of the discarded traps.

You will need my centTrapHandler-2.x.patch, available from my github here, before you get started.


Modifying the centTrapHandler-2.x to allow trap filtration
  1. Download the centTrapHandler-2.x.patch
  2. Backup the current centTrapHandler:
    cd /usr/share/centreon/bin ; cp centTrapHandler-2.x centTrapHandler-2.x.bak
  3. Patch the centTrapHandler:
    patch centTrapHandler-2.x < centTrapHandler-2.x.patch
  4. Create a new logging directory for ignored trap logging.
    mkdir /var/log/snmp
 The new trap handler is now ready to use.



Filtering known traps with Centreon

Now that you have added the capability to filter traps, there is a few things you need to know before applying filters. The following need to be true for the trap to be filtered.
  1. The service must be passive.
  2. The service must contain the key word "_LOG". You can change the key word it in /usr/share/centreon/bin/centTrapHandler-2.x by searching for /_LOG/ and replacing it with /yourKeyWordHere/
  3. cust_unknownSkipEnable must be set to 1 in /usr/share/centreon/bin/centTrapHandler-2.x. This gives you an easy way to enable and disable the filtering as required.
  4. The trap must trigger as status "Unknown", we can do this by using the "Advanced Rule matching" capabilities of Centreon trap definitions.

If you have existing services you will need to rename them with _LOG in the title to support trap filtering.

To setup the filters you need to do the following:
  1. Go to your Centreon trap definitions:
    Centreon > Configuration > Services > SNMP Traps
    .
  2. Open the trap in question. This should be one of the custom traps associated with the trap OID of your switch that you created.
  3. Enable "Advanced matching mode" and create a new "Advanced matching rule" with the following properties.
    String: @OUTPUT@
    Regexp: /your matching text here/ - For example: /port (.*) is Blocked by STP/ to block a STP warning
    Status: Unknown
  4. Save the changes
  5. Export the trap definition to Nagios.
    Configuration > Nagios > SNMP traps > Generate

Now when your matching text is received by the trap handler, it will be matched, set as Unknown and then dropped and logged by the centTrapHandler.

If you need troubleshoot trap filtering you can check the logs in /var/log/snmp, snmptrap_ignored.log is your matched/ignored traps and snmptrap_logging.log logs all traps received by passive services.