Monday, July 30, 2007

The Myth of the Self-Monitoring WLAN

Recently, as you all probably know by now, Duke University had a WLAN meltdown. The CIO, Tracy Futhey (Comment here) and the assistant IT director, Kevin Miller (Comment here) have put to rest the notion that the Apple iPhone caused it. Cisco has issued an advisory to the effect and Apple assisted in the effort.

I am not going to go into the details of what happened or why. Suffice it to say that mobile handhelds of all types, not just iPhones, send a lot of ARP traffic and the Cisco infrastructure was not ready for it. The quote at Network World explains that, "The advisory finally makes it clear that the iPhone simply triggered the ARP storms that were made possible by the controller vulnerabilities. Any other wireless client device, moving from one subnet to another apparently could have done the same thing."

What I will point out, however, is the problem we in the Wi-Fi community have today with the following simple delusion, "Your WLAN infrastructure as a cohesive, integrated, single-vendor solution is all anybody needs. It is self monitoring and self healing." I talk to a lot of people about which WLAN solution they are going to purchase and implement and I am always surprised by how many believe that the AP and controller vendor has all the answers. Don't get me wrong, I am a huge fan of this type of solution. Central management is critical for even medium sized organizations of 50 or more APs, much less larger ones that may a few hundred or even thousands. Manually changing the configuration of each AP is not a viable solution in these cases. The Admin needs assistance. And the story sounds so great, "Implement our solution and it will fix itself when it breaks and protect itself when security policies are breached." Who wouldn't want that?

But the truth is a little more complicated. As we have seen from previous posts, sometimes the solution doesn't behave the way your business practices need. Similarly, sometimes there are security problems within the infrastructure itself. So what to do?

This will sound like an advertisement for the company I work for and I apologize ahead of time but there is a very good reason I continue to work there. Mainly, I believe in the message.

When the Duke network went down and the Assistant IT director looked at his WLAN infrastructure dashboard, what did he see? I have not spoken with him directly but my guess would be it said, "hey man, it ain't me. Everything looks good from my end" So what did he do? he pulled out a sniffer and got to work. With packet traces in hand and assistance from Cisco and Apple he solved the problem. Did the infrastructure fix itself? Did it correctly identify the problem and solution? No. A patch is now needed to keep this from happening again.

One should not blame the infrastructure for not getting this right at the outset nor should one blame Mr. Miller. He was correctly reading what the controllers were telling him. But it shows how important it is to have a separate, 3rd party solution also available to get down to the bits and bytes or even spectrum analysis (if the problem should be something other than 802.11 protocol madness.)

There are a few great WLAN security vendors out there and they make 3rd party, best of breed solutions for monitoring the security of your WLAN (one of which recently got snatched up pennies on the dollar and will probably be rolled into another integrated, self-healing, self-monitoring role; against my better judgment.) There are an even smaller number who both monitor your security and your connectivity and performance and give you great troubleshooting tools built-in (insert shameless plug here). These should be your trusted advisor's when things go wrong. I am in no way suggesting that they would have identified the problem and cause and given a solution at Duke either (although I think they at least would have shown alerts for denial of service and strange traffic behavior.) What I am suggesting is that with them in place you now have a set of tools to assist in solving the problem. Remote packet and/or spectrum analysis. Alarm thresholds that can be set by the admin and will continue surveillance. Reports. System-to-system notifications. Graphs of speed and traffic type. Lists of who is connected to what and how. All the things you would need to get to the bottom of any problem in that invisible Luminiferous Ether.

Friday, July 27, 2007

Cisco Ripples - DCA and RRM - Help is on the way

Since I first published " The Ripple Effect" back in February I have heard from many folks who have validated the effect but to my chagrin, I have had no solution to offer. Well thankfully there are smarter people than me out there and solutions have started to appear.

I was alerted to the fact that Medical Connectivity consulting recently put Cisco in their sights and quoted my blog with regard to Dynamic Channel Assignment and RRM causing issues. The Web, being the great time waster that it is, lead me on a journey. As I read the article I clicked here and there and next thing I knew I was looking at a forum at Cisco that was talking about this exact phenomena.

One of the forum posters had some great suggestions to eliminate this problem in the future. Bruce Johnson at Partners Healthcare offered this solution,
"We saw the majority of DCA events were triggered by Interference from Rogue APs. After we disabled Foreign AP Avoidance the number of channel changes dropped by an entire order of magnitude (1000s to 100s). We disabled Cisco AP Load Avoidance and this reduced the number of DCAs within an order of magnitude (100s less).

DTPC will power-up APs to max levels to provide a 3-neighbor -65 RSSI coverage "grid" and 7921s will power up to follow suit (up to their max Tx Power). Other clients with higher Tx power may send the APs to max power causing a mismatch with IP phones.

You can decrease the tx-power-threshold so the "grid" won't be as hot (default is -65, change to -71 or -74):

config advanced 802.11a tx-power-control-thresh <-50 to -80>
config advanced 802.11b tx-power-control-thresh <-50 to -80>

and reduce the coverage hole detection threshold (reduce Min SNR level in RRM Thresholds) to suppress the power-up activity."
Bruce seemed on track with this fix. the problem is that it isn't a fix. It shuts off the RRM and DCA so that the WLAN would remain stable. So where is the benefit of a controller based system?

He does note that a fix is forthcoming from Cisco, "They are revamping the behavior of RRM in the WLC 4.1 Maintenance release." Which is later confirmed by a Cisco employee, Saurabh Bhasin a TME,
"With the 4.1 Maintenance Release(MR) due out on cisco.com shorly, many improvements based on such feedback have been brought into RRM's algorithms ? improvements aimed at allowing administrators to fine-tune their RRM-run WLANs where desired. These enhancements will allow for greater control over both the channel and power output selection algorithms, so administrators may assist RRM in being either more or less aggressive in such decisions, depending on application and network needs. Additionally, enhancements have been made to the management and reporting of all RRM information and configuration alterations to allow for better tracking of RF environmental fluctuations and to assist in keeping track of RRM activity. Further technical detail on the inner workings of these enhancements will be available very soon in an update to the above-mentioned RRM Whitepaper."
The paper he references is found here http://www.cisco.com/warp/public/114/rrm.html and explains a lot of what we are all seeing. (here is the PDF version)

(NOTE: Since publishing this post, Cisco has moved the link. Here is a more recent version. Please double-check with Cisco that you have their latest information)

So here is to hope that WLC 4.1 Maint. Rels. fixes it. As an aside, Bruce Johnson is skeptical,
"Its all well and good to make things work for Intel and the CCX/CCKM compliant crew, but if you have any of the other brands of WLAN NICs (like those made by medical device manufacturers, who won't subscribe to fast roaming features until they're adopted by the IEEE) you are best keeping RRM disabled until it delivers on its promise as stated in the following 802.11TGv Objectives draft:

Service and Function Objectives

Solutions shall define mechanisms to provide the service listed below.

[Req2000] TGv shall support Dynamic Channel Selection, to allow STAs to avoid interference. Solution shall be able to change the operating channel (and/or band) for the entire BSS during live system operation and be done seamlessly with no intermittent loss of connectivity from the perspective of an associated STA. Solution shall not define algorithm for channel selection."