Saturday, February 3, 2007

The Ripple Effect - Problems with Cisco’s Radio Resource Management (RMM)

(NOTE: This information is out of date. Cisco has changed their RRM calculations and have removed much of the information linked to in this post.)


Introduction:
In its Unified Wireless Network architecture, Cisco has developed patent pending technology for dealing with interference detection and avoidance, dynamic channel assignment, dynamic power adjustment, coverage-hole detection and correction, rogue detection and client load balancing. This system is known as RRM or Radio Resource management. The stated goal of which is to avoid problems in the fixed ISM band of 802.11b/g where only 11 channels are available to U.S. WLANs. This system, though sound in theory, has problems when applied to large WLANs in urban areas or locales that have heavily deployed WLANs such as Metro WiFi, skyscrapers, hospitals, universities and businesses near residential neighborhoods.
Background on Channel Overlap:
Anyone who has configured their own home access point (AP) knows they are allowed to choose a channel for the AP to transmit on. Since APs use Dynamic Spread Spectrum technology they actually utilize 5 channels per AP.
If an admin were to configure APs to use all channels in the 802.11b/g spectrum, a serious decrease in available bandwidth would occur and users would experience sever throughput loss. Thus an admin is restricted to only configure his/her APs to 3 non-overlapping channels; 1, 6 and 11. In some cases an admin may opt, out of necessity, to go for a slight overlap and configure a 4 channel plan consisting of channels 1, 4, 7 and 11.

WLAN planning and Site Surveying:
Administrators need to then plan out their deployment so that each AP avoids overlapping its coverage with another AP on the same channel. APs must have their power adjusted to compensate for walls and coverage gaps that may ensue when a building is not a standard rectangular shape or when neighbors move in and configure their AP on a channel used by the organization the admin works for. This adjustment in power may increase or decrease the size of the cell of each AP and the additional adjustments to all the other APs will now be needed. Lastly, the admin must plan for areas where usage may change very dynamically such as in conference rooms and auditoriums. As one can see, this is really an art and a whole industry has evolved around designing wireless networks. Usually a Site Survey is needed to map out the existing neighbor APs as well as to plan where to place and map the new APs. Surveys are also recommended from time to time to adjust to changes that may happen around the organization as well as within it.

Cisco's Solution:
The Cisco Unified Wireless Network (UWN) architecture hopes to avoid this problem by sensing the types of problems that occur in WLANs and automatically compensating. Problems such as:
  • A neighbor moving in next door or upstairs and implementing APs that overlap yours
  • Coverage gaps that occurs when walls, cubicles and other furniture are moved, added or removed
  • Loss in throughput when people, who are 78% water, move around in a company and group together in conference rooms or other areas (water attenuates or "blocks" radio waves)
Cisco has (had) a brief description on their website at HERE and a much more in depth description HERE
On that second page Cisco describes how this works under the section entitled, "Radio Resource Monitoring"


Management of an RF network requires strong visibility into the factors affecting the air space. Cisco lightweight access points are specially designed to not only offer service, but to also monitor all channels at the same time. This is a result of the extensive development work Cisco has performed on the 802.11 MAC layer as part of its split MAC architecture.
In addition to offering service, Cisco lightweight access points can simultaneously scan all valid 802.11a/b/g channels for the country of operation, as well as for channels valid in other geographies. This provides the highest level of protection-the system will discover rogue access points that might be imported from other countries, or a hacker that knows how to change the country of operation such that the rogue would be out of band and not detected by most WLAN intrusion detection systems (IDSs).
The Cisco lightweight access point goes "off-channel" for a period not greater than 60 ms to listen to these channels. Packets collected during this time are sent to the Cisco Wireless LAN Controller, where they are analyzed to detect rogue access points (whether service set identifiers [SSIDs] are broadcast or not), rogue clients, ad-hoc clients, and interfering access points.
By default, each access point spends only 0.2 percent of its time off-channel. This is statistically distributed across all access points so that adjacent access points are not scanning at the same time, which could adversely affect WLAN performance. This enables administrators to build a picture of what is happening in their WLANs from the perspective of every access point, and increases network visibility beyond what an overlay network can provide, eliminating the "hidden node" problem that can result when air monitors are deployed for every three to five access points.

I will not debate the issues around part time scanning in this article; many others have addressed that already. But I will address the next issue which is how Cisco responds once it has discovered any of the aforementioned problems.

When a station has something to say, it announces it to the media. An access point will allow the station to send its data if the medium is open. If not, the station will be told to wait to transmit until other stations using that medium are finished with it. This prevents two clients from transmitting on the same channel at the same time, which would result in corrupted frames.
With CSMA/CA, two access points on the same channel (in the same vicinity) will get half the capacity of two access points on different channels. This becomes an issue, for example, when someone reading e-mail in a café affects the performance of the access point in a neighboring business. Even though these are completely separate networks, someone sending traffic to the café on Channel 1 can cause data corruption in an enterprise using the same channel. Cisco wireless LAN controllers address this problem and other co-channel interference issues by dynamically allocating access point channel assignments to avoid conflict. Since the Cisco lightweight solution has enterprisewide visibility with its RRM tools, channels are "reused" to avoid wasting scarce RF resources. In other words, Channel 1 will be allocated to a different access point far from the café. This is much more effective than not using Channel 1 altogether, which is what other WLAN systems often do.

Later in the same document it describes a similar situation as Interference.


"Interference" is defined as any 802.11 traffic that is not part of the Cisco WLAN system, including a rogue access point, a Bluetooth device, or a neighboring WLAN. Cisco lightweight access points are constantly scanning all channels looking for major sources of interference (Figure 3).
If the amount of 802.11 interference a predefined threshold (the default is 10 percent), a trap is sent to the Cisco Wireless Control System (WCS).The Cisco Wireless LAN Controller will attempt to rearrange channel assignments to increase system performance in the presence of the interference.
Again I will refrain from diving too deep on interference sources as Cisco does not even have a way to detect much less respond to such non-803.11 interferers as Cordless phones, baby monitors, wireless cameras, DECT phones and headsets etc.


The Problem:
When you have a large number of APs implemented and you are covering a large area, the Cisco system will adjust to compensate for rogues, neighbors and interferers almost continuously. As you add more and more interferers in and around the WLAN, more and more adjustments must be made to compensate for these. As the compensations take place they run into adjustments coming the other direction from the other side of the building and you get a huge ripple effect that will in some cases cancel out adjustments and in others build up over adjustments. The WLAN starts to behave like a wave phase experiment.


Example:
Let us say that we are in a hospital in San Francisco where the average number of APs per block is around a hundred. The hospital has 20 APs per floor and 10 floors in the main building. That's 200 APs, which is quite a large number. This hospital, since it is in an urban setting has many neighbors, many of whom also have APs.
In a typical situation a neighbor to the hospital puts an AP on Channel 1. The Cisco architecture senses this and adjusts to compensate, moving APs from adjacent channels to ones farther away. At or around the same time but on the other side of the hospital, another neighbor appears but this time the AP is on Channel 11. A similar situation occurs there. At some point the two waves of adjustments meet or cross in the middle. This is made possible because the split MAC architecture of the Cisco UWN has many decisions made in its WLAN controllers. These controllers are distributed and can act semi-independently. By the time the wave reaches the other side of the hospital, the system realizes it is again being interefered and readjusts.

This wave or ripple action, because it moves across floors and up stories may go on forever. As more neighbors or interferers come on line more waves are sent out. The larger the implementation the worse the problem gets. The effect is readily visible and measurable to anyone with a WLAN analyzer. They will see MAC addresses hopping from one channel to the next on a second by second basis. They will also be changing output power continuously so the signal will be rising and falling.


Effects of the "Ripple"
The net effect of this phenomenon is a serious decrease in throughput and a large increase in latency. If you use your WLAN for applications that need low latency or high throughput such as VOIP over a WLAN (known as VoWLAN or VoFi) or you have low power handhelds such as the kind used for barcode scanning, this network is unusable. The VoFi traffic will be filled with jitter and conversations will be choppy at best. The handhelds will never be able to sleep or go to low power as they will always be probing for changes to the environment. If the system had been statically mapped to specific channels that do not change, the WLAN would have had problems, for certain, but these problems would be affecting just the few APs that face the neighbors. Now that all the APs are reconfiguring continuously, the whole WLAN is affected all the time.
WLAN STAs that are associated and attempting to pass data will continuously be probing for new channels and APs to associate with. The amount of roaming will go up dramatically. Roaming takes a few seconds to complete so the problem will be very serious for the end user.

Cisco even mentions this problem in one of their release notes for the CB21AG card found here: HERE


CSCse49324-CB21AG retransmission mechanism has problems with RRM in LWAPP network
A CB21AG client that is operating in an LWAPP infrastructure loses connection for small periods of time. When the AP is performing radio resource management (RRM), the AP goes off channel. During these periods, the AP cannot hear and answer ACK and RTS frames from the client. The client card initiates a scan for another AP, and network traffic for the client is affected.
Workaround: Increase the HwTxRetries value from 4 to 14 (registry entry) so that the client card continues to retry for the 20 to 30 milliseconds that the AP is off channel.

SpectraLink and other VoWLAN vendors specifically warn their customers not to deploy their Cisco UWN architecture with RRM enabled. When a WLAN needs to support voice, the requirements for stability increase dramatically.


Conclusion:
The idea behind automatically adjusting and configuring networks is a good one. Maybe sometime in the near future Cisco will program their controllers to avoid this type of effect but in the meantime, unless you have a pretty small network or are located far from interference sources and neighbors, admins are urged to complete a thorough site survey and statically map all their APs to a channel and resurvey from time to time.

4 comments:

  1. Bruce:

    Before Cisco purchased the technology from Airspace, they had already put dampeners in the RRM so the hysteresis you describe wouldn't occur.

    Besides the rumors that I have also heard, do you any personal testimonials backing up your claims concerning this ripple effect? When that happened, what did Cisco recommend to their customer?

    BTW, DECT runs at 1920-1930 MHz in the U.S., of no concern for 2.4 GHz WLANs.

    Regards,

    Frank

    ReplyDelete
  2. Frank,

    First off, thanks for visiting. Always good to chat with the press. Secondly, I need to make clear that all comments on this blog are mine and mine alone and do not represent the official policy of my employers or staff.

    That being said, I have also heard the claim that some sort of dampener was used as well but my experience has been different. As I state in the blog, it is easily measurable by an analyzer and I have seen the effect at various locations myself. These include several hospitals. Additionally, I have spoken directly with people in charge of deploying VOIP over WiFi for SpectraLink and it is their stated policy to disable RRM and map the devices statically. They seem to have had too many negative experiences


    Lastly, regarding DECT: again I agree that DECT as found in cordless handsets in Europe is not at risk of stomping on the 2.4GHz band, but I have seen what Cognio classifies as DECT at almost every office I got to. Cognio's FAQ located here at http://www.cognio.com/index.php?
    option=com_content&task=view&id=46
    &Itemid=32
    shows that DECT can be found here in the states in the 2.4 range as well.


    DECT
    DECT (Digital Enhanced Cordless Telecommunications) is a ETSI (European) standard for cordless Telephones. There are many variants of this standard implemented in several frequency bands. These classification templates match different DECT variants, and distinguish between Broadcast traffic sent by the Base station (analogous to Beacons in an 802.11 network), and off hook (per Handset) streams used to carry voice data. This protocol is based on a combination of Frequency Hopping, and a TDM (Time Division Multiplexing). Examples include the Panasonic KXTG2357


    Also, aside from experiencing this myself I have a PDF from the CTO at Cognio that also states and shows DECT operating in the 2.4GHz band found at this URL http://www.bicsi.org/Events/
    Conferences/Fall/2006/Presentations/Diener.pdf
    If you go to slides 16 and 17 you will see a cordless DECT phone in the 802.11b/g space

    Thanks, Bruce

    ReplyDelete
  3. Bruce,

    I can vouch for having observed this recurrent DCA behavior, also in a hospital environment (12-24 channel changes per day across 10 floors of APs, as you depict in your example). The architecture is not alerting us to this being the result of interference or noise (no WLC or WCS events of either type), and the RSSI of rogue APs is above the threshold required for triggering DCA (neg 85dB). We have been told by Cisco that the 100mW AP neighbor beacons, used to determine the picture of the network, does not get input into DCA. Cisco claims these 100mW beacons are used only for dynamic power control, which we hold static -- do you think this voids the dynamic algorithms? Other docs say the RSSI of neighbor APs is the most important criterion in DCA behavior! In lieu of noise and interference alerts we can only surmise its the APs themselves that are the cause of their own DCA ripple effect.

    Do you have any data to support its rogue APs that are the cause of the DCA you have seen? Do you concur with the -85dB or better rogue AP RSSI threshold as the trigger for a DCA channel change? In these cases do the controllers or WCS report interference issues or just rogue AP alerts?

    Thanks,

    Bruce Johnson
    Network Engineer

    ReplyDelete
  4. btw, the bug you reference above (regarding CB21) was fixed back in 2006...

    ReplyDelete