It's been a while since I've posted anything and I apologise for this. Life has been busy and fun both professionally and personally! I hope to make up for this now with a short article on a tool that is truely epic.
Back Story
Picture the scene: you're a Network Engineer sitting at your desk happily calculating broadcast addresses for arbitrary subnets for fun when a priority one call comes in!
"The CAD/CAM file transfers from the London file servers to the Bristol office are slow, and we begin a production run of two thousand 1968 Raleigh Chopper replicas tomorrow morning!! Fix it now, please please."
You panic. For one thing you know that there is not a network diagram that expresses that part of the network, but also it's the end of your shift in 20 minutes, and the footy's on. You've gotta solve this, and fast!
You start by sketching out what you believe the topology should be and realise that there are at least seven hops between the file servers and Bristol hosts, and a good few redundant paths. You let out a sigh of such magnitude that even the local DBA on shift two desks down pokes his head up and shoots a pitying look your way. And you're not even sure this topology is accurate.
After procrastinating, and weighing up whether or not you could make it to the pub in time for kick off even if you did hit the building power isolator, you begin the arduous take of using traceroute to map out the actual topology.
Now you you're left with a list of IP addresses that make up the devices that the packets cross as they make their journey from the London office to the Bristol office. The list is long. Carefully you open a terminal window for each IP address and ping each one hoping to see some packet loss from a device that would point to a cause.
As you stare at all 21 windows each pinging away you notice that the footy had started 15 minutes ago. You take your disappointment out on your mouse.
But then you notice.... "request timed out" intermittently from one of the windows! Yes, you've found your device that's playing up! One of the routers in the path is dropping packets intermittently - probably has a knackered interface. No problem, you log onto the router and down the link. OSPF takes care of the rest and the Bristol office can now operate again at full speed! Not only can they download their CAD/CAM documents for tomorrow manufacturing run, but the users can update their Facebook too! You are a hero, but you're at least two beers behind and it'll be half time by the time you get to the pub. Not good. If only there was a better way........
The Solution
Matt's Trace Route (mtr) is a wonderful utility available for Windows and Unix which combines the network mapping abilities of traceroute and the packet loss and round trip time testing capabilities of 'ping'. It does nothing new as such, but what's clever is that it does two things together and represents the data clearly to the user.
In short, it carries out a traceroute to determine topology, and spews forth ICMP packets to determine packet loss and route trip time.
Lets take a look!
> sudo /usr/sbin/mtr 192.168.104.45 -n
Standard Display Mode
Matt's traceroute [v0.54]
bristol_host.example.com Wed Aug 19 21:15:14 2009
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. 192.168.100.254 0% 10 10 0 0 0 0
2. 192.168.101.6 0% 10 10 0 0 0 0
3. 192.168.102.45 0% 10 10 0 0 0 0
4. 192.168.103.45 0% 10 10 0 0 0 0
Graphical Display Mode
Matt's traceroute [v0.54]
bristol_host.example.com Wed Aug 19 21:47:42 2009
Keys: D - Display mode R - Restart statistics Q - Quit
Hostname Last 60 pings
1. 192.168.100.254 ........................................
2. 1192.168.101.6 ..........................2............
3. 192.168.102.45 ......>...................3............
4. 192.168.103.45 ..........3............................
Scale: .:0 ms 1:0 ms 2:1 ms 3:2 ms a:4 ms b:7 ms c:11 ms
You start the program (as root) with the command 'mtr ipaddress'. It the maps the route to the target host, and pings every hop along the route. It shows best, worse and average route trip time as well as packet loss. (This is the part that gets you to the footy on time.) In this way, you can easily see which router is loosing or delaying packets.
There is also a graphical display mode that gives you a rolling time line of 'dots' moving from right to left. Each 'dot' represents a an echo request/reply between the hop and the program with different numbers and letters representing different round trip times. A greater than '>' indicates a lost packet.
You can see from the above display that the route is pretty good. This truly is a tool of epic usefulness in any environment!
How It Works
Once started mtr will enumerate the path to the host by sending a number of ICMP echo requests (with that host IP address as the destination). The first packet will have a TTL of 1, the second a TTL of 2, and so on. The software then listens for ICMP time exceeded messages which will be sent back to the host running mtr by each router along the path that decrement the TTL to 0. In this way, each subsequent ICMP echo request packet triggers a time exceeded packet from a router that is an extra hop away.
Packet where TTL = 1 triggers an ICMP time exceeded from the first hop
Packet where TTL = 2 triggers an ICMP time exceeded from the second hop
And so on.....
The program stops sending ICMP echo requests with incrementing TTLs when the target host sends a ICMP echo reply: it now knows that it has the end to end path across the IP network.
At the very same time as mapping out the network, the program is using the time delta between echo request & time exceeded (or echo reply in the case of the target host) to calculate round trip time.
It's also on the look out for time exceeded packets that it expects to receive but which do not arrive. If they don't arrive this is indicative of a loss of the triggering echo request or the reply time exceeded. It's this mechanism that gives us our packet loss %.
We can see this is more detail in a tcpdump output:
22:05:38.373299 IP (ttl 1) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 0
22:05:38.373530 IP (ttl 64) 192.168.101.254 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:38.473280 IP (ttl 2) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 256
22:05:38.473466 IP (ttl 63) 192.168.101.6 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:38.574223 IP (ttl 3) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 512
22:05:38.574416 IP (ttl 62) 192.168.102.45 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:38.675275 IP (ttl 4) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 768
22:05:38.675584 IP (ttl 62) 192.168.103.254 > 192.168.100.51: icmp 44: echo reply seq 768
22:05:38.776228 IP (ttl 5) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 1024
22:05:38.776498 IP (ttl 62) 192.168.103.254 > 192.168.100.51: icmp 44: echo reply seq 1024
22:05:38.977330 IP (ttl 1) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 1280
22:05:38.977567 IP (ttl 64) 192.168.101.254 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:39.178229 IP (ttl 2) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 1536
22:05:39.178437 IP (ttl 63) 192.168.101.6 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:39.380149 IP (ttl 3) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 1792
22:05:39.380358 IP (ttl 62) 192.168.102.45 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:39.581098 IP (ttl 4) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 2048
22:05:39.581360 IP (ttl 62) 192.168.103.254 > 192.168.100.51: icmp 44: echo reply seq 2048
22:05:39.832172 IP (ttl 1) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 2304
22:05:39.832412 IP (ttl 64) 192.168.101.254 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:40.083052 IP (ttl 2) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 2560
22:05:40.083242 IP (ttl 63) 192.168.101.6 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:40.334965 IP (ttl 3) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 2816
22:05:40.335195 IP (ttl 62) 192.168.102.45 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:40.585919 IP (ttl 4) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 3072
22:05:40.586185 IP (ttl 62) 192.168.103.254 > 192.168.100.51: icmp 44: echo reply seq 3072
22:05:40.836984 IP (ttl 1) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 3328
22:05:40.837226 IP (ttl 64) 192.168.101.254 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:41.043594 IP (ttl 2) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 3584
22:05:41.043802 IP (ttl 63) 192.168.101.6 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:41.294772 IP (ttl 3) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 3840
22:05:41.294979 IP (ttl 62) 192.168.102.45 > 192.168.100.51: icmp 36: time exceeded in-transit
22:05:41.546790 IP (ttl 4) 192.168.100.51 > 192.168.103.254: icmp 44: echo request seq 4096
22:05:41.547087 IP (ttl 62) 192.168.103.254 > 192.168.100.51: icmp 44: echo reply seq 4096
Issues with Network Mapping using TTL values
It's also work mentioning (isn't it obvious?) that for this program to function on your network ACLs/Firewalls/Packet Filters on layer three devices along the path to the target host must not drop ICMP echo requests. Equally so, they must return ICMP time exceeded when they reduce the TTL to 0 (this should almost always be the case only the rudest of layer 3 devices refuse to send ICMP time exceeded packets).
Rule of thumb: if you can ping the host, you can use mtr.
Other Points?
Well, I should state that mtr works differently to the Unix trace route which (by default) uses UDP packets with incrementing TTLs to map the topology. As explained, mtr uses echo requests.
And finally...... if your network (or someone else's) blocks ICMP and UDP packets for network mapping, then you should try 'tcptrace'. It's traceoute but uses TCP packets so the rule of thumb becomes 'if you can telnet/ssh/insertanyothertcpservicehere to the host, you can trace to it'. Just watch out for Proxy firewalls :)
Thanks
Thanks to two terrific network engineers who (1) introduced me to these tools and (2) showed me a cool problem that got me thinking. They know who they are. Well maybe they don't. But I'll tell them sooner or later.

