Wednesday, July 25, 2007

using NetQoS to diagnose network congestion

We use a backend connection from our corporate offices to our hosting provider. On Monday towards the end of the day, I noticed a big slowdown in our RDP (Remote Desktop) connectivity to our Windows servers at the hosting provider. Monday is normally a very busy day for network traffic in the office as well as our main website. So I wondered where the traffic was coming from. Luckily, our network engineers had recently purchased a suite of network management tools from NetQoS (http://www.netqos.com/).

The engineers setup NetQoS monitors on the backend pipe to allow us to see the traffic going through our T1 speed, Frame Relay connection. The nice thing about NetQoS is the ability to break down the traffic going back and forth through the pipe by protocol and port. In the below graph labeled "Stacked Protocol Trend In" for this past Monday, you'll see a green blob after 15:00.

"Trend In" is data inbound to the monitored router. In our case, inbound is into our corporate network. "Trend Out" is data going outbound from the monitored router. In like fashion, outbound is out to our managed hosting provider.

As indicated by the key on the right side of the diagram, this is a spike in SSH (port 22) traffic:


Interesting! The funny thing is that a similar network performance degradation happened the previous week. I changed the date range for the NetQoS reports and saw an even larger, longer lasting spike in SSH traffic:


OK. So it seems we have a problem, but what is the cause?

We typically use SSH to monitor various Unix servers in the environment. One of the administrators has a bunch of X sessions that display performance statistics from a number of servers at the hosting facility. This X session traffic is done over SSH port. He told me that while the high SSH traffic issue was happening, one of his X session graphs was not refreshing properly. I suspected that since he was the main administrator of the servers and that he had about twelve X sessions open, that there is a bug in the X session software that was making one or more of his monitoring sessions create this spike in traffic.

At this point, he still has his X sessions open, but we will wait until the next spike in SSH traffic to try to determine if it is this admins' X sessions that are the culprit. At that time, we will probably shut down his sessions to see if that fixes the problem. I will update the blog next week to let you know how it goes.

Anyway, I'm glad that NetQoS helped us troubleshoot the problem. It gives us some good insight into network traffic by TCP/IP application.

UPDATE: another great tool to diagnose your local computer's network traffic is IPTraf (http://iptraf.seul.org/). I will review this software which runs under Linux and give a basic description of how to use it in an upcoming post.

cheers!

No comments:

Feel free to drop me a line or ask me a question.