Question : Troubleshooting UDP connectivity with VoIP in a campus network (a.k.a. UDP packets going AWOL)

Hello Experts,

I have a challenge.  How can one troubleshoot a connectionless protocol?  Let me explain. I am in the process of installing a Nortel VoIP solution into our campus network.  Since day one we have had issues with phones rebooting and not maintaining a consent connection with the signaling servers.  After weeks of Etherreal traces and lots of back and forth, Nortel tells me today that the issue is that a packet and/or series of packets is leaving the signaling sever and is not making it to the phone.  That the signaling server does not receive an expected ACK and retransmits.  When the ACK is not received a second time, the signaling server assumes the phone has taken a vacation and no longer sends "watchdog reset" messages to the phone.  When 200 seconds have passed and the phone has not received a watchdo timer reset packet, the phone takes it upon it's self to reboot.  

Now for a fun part.  All of this traffic looks like UDP and rightly so is classified by Etherreal as such.  But, because the phones and the signaling servers know, I'm guessing written in the firmware, that after a set of instructions are received, a UDP datagram is sent back saying the packets were received, this is in fact no ordinary UDP.  It's like a connection-oriented protocol without the nasty, expensive overhead.  Hence, this is what Nortel calls this Reliable UDP or RUDP.  RUDP, can only be discerned from UDP by a special dissector for etherreal, which Nortel will not release.  Because we have a sniffer trace from both sides of a conversation (one from the phone and one from the sig server) and because they (Nortel) see the packets in the trace from the signaling server and not in the trace from the phone, they are saying that the network is dropping these packets.  Now I only saw UDP, but when Nortel applied their special x-ray packet vision they saw button messages, time sets, display messages, etc. etc. Now we're not talking one packet here or there, no, we are talking 15-20 packets in a row over a period of 2-3 min that are not making it from one side of the network to the other.  They have placed it on my shoulders to find out what is causing the packets to disappear.  Thank you Nortel.

Now before anyone jumps to any conclusions here are some fast facts.  72% of the phones reboot at some point.  Some never seem to reset.  More on that later.  40% never make it more then one day with out resetting.  Of the phones that never reboot, at least one exists in just about every subnet on campus.  Every subnet has at least one phone that will reboot.  With the exception of a campus wide power failure, not more then three phones have ever rebooted at almost tthe same time. (Within 3 secs).  The longest period with out a phone resetting some where on campus is 9 min.  Network counts are clean on the ports: no jabber, giants, collisions, Framing Error, CRC's, Align, Discards, drops, overrun, underruns, etc.  There are no ACL's or policies that filter any traffic.  The network now is 10/100MB switched at every edge port with 1000MB links between the edge and distro layers.  There are no more then 162 ports service by any one 1000MB link.  The average number of ports service per 1000MB link is 48.  Up until this point, we felt pretty good about our network.

That's the short story, believe me it is.  Because it is obvious that a great majority of the RUDP packets are getting across the network and getting responses, I can only assume that the break-down is intermittent.  Meaning, I can run an nmap scan of the ports in question and they will be open as expected.  I am predicting that I would only see a failure if I was constantly trying to "ping" the port.  Also, to make it more like the phone and the signaling server, any testing I do would have to set the DiffServe Code point just like a voice and signaling packet is marked.  Hey, if the network that rarely sees 3% utilization of on any link is fully utilized, I want the voice/signaling traffic prioritized.

So now the challenge.  I have searched high and low to find a test tool or a methodology that I could use to try and troubleshoot my UDP issue.  Short of putting a sniffer on every link and trying to watch for one packet to make its way across the network, which sounds like fun, is not what I am looking forward too.  Trying to time sync multiple sniffers I think would be the biggest problem.   I am hoping that someone out there might know of a tool or posses the knowledge to create one, that would send a UDP packet to a host and log that the packet made it.  And do this until I tell it to stop.  I have an idea for such a tool but I really don't want to start to try and learn how to program!  If anyone has a methodology to test this, I'm open to ideas as well.

Due to desperation and difficulty I am giving maximum points, I only wish I could make it worth more.

Thanks,
Matt

Answer : Troubleshooting UDP connectivity with VoIP in a campus network (a.k.a. UDP packets going AWOL)

PAQed with points refunded (500)

CetusMOD
Community Support Moderator
Random Solutions  
 
programming4us programming4us