traceroute breaks the time-space continuum!
Table of Contents
What’s traceroute? #
traceroute is UNIX tool1 that attempts to trace packets… Humm, the man
page explains it beatifully:
This program attempts to trace the route an IP packet would follow to some internet host by launching probe packets with a small ttl (time to live) then listening for an ICMP “time exceeded” reply from a gateway. We start our probes with a ttl of one and increase by one until we get an ICMP “port unreachable” (or TCP reset), which means we got to the “host” […]
The ICMP Time Exceeded messages it refers to are:
| IP version | ICMP type | ICMP code | RFC |
|---|---|---|---|
| 4 (ICMP) | type=11 Time Exceeded Message | code=0 time to live exceeded in transit | RFC792 |
| 6 (ICMPv6) | type=3 Time Exceeded Message | code=0 Hop limit exceeded in transit | RFC4443 |
traceroute, uses UDP packets to probe the network by default, but it can
optionally use ICMP(-I) or TCP (-T) instead.
Tracing to google.com 🗺️ #
OK, let’s do it:
traceroute -n google.com
Results come shortly after:
traceroute to google.com (142.250.200.206), 30 hops max, 60 byte packets
1 192.168.1.1 3.550 ms 3.424 ms 3.346 ms
2 81.46.38.147 20.595 ms 20.536 ms 20.475 ms
3 81.46.34.125 79.685 ms 79.602 ms 79.540 ms
4 * * *
5 * * *
6 * * *
7 81.173.106.65 26.469 ms 15.869 ms 15.664 ms
8 74.125.245.171 17.053 ms 108.170.225.251 14.295 ms 17.687 ms
9 108.170.255.26 27.739 ms 192.178.110.134 16.176 ms 142.251.76.110 13.844 ms
10 108.170.252.173 19.309 ms 64.233.175.63 19.250 ms 108.170.252.173 15.359 ms
11 172.253.65.53 40.944 ms 42.708 ms 142.251.79.237 42.634 ms
12 72.14.234.190 44.364 ms 142.251.254.74 43.244 ms 192.178.85.180 44.738 ms
13 192.178.105.215 43.094 ms 42.461 ms 44.523 ms
14 209.85.243.145 44.422 ms 42.834 ms 209.85.143.123 44.361 ms
15 142.250.200.206 44.262 ms 42.788 ms 40.920 ms
Wait… what’s going on with the RTTs? 🤔 #
Something’s wrong… How come the Round Trip Time (RTT)
to 81.46.34.125, the 3d hop, is ~4x higher than the one to 108.170.252.173, the 10th hop?
Is traceroute not reporting the real RTT? #
According to the man page, it should:
Three probes (by default) are sent at each ttl setting and a line is printed showing the ttl, address of the gateway and round trip time of each probe.
Are different probes following different paths? #
Ha! Maybe… let’s see.
When probing using UDP with the default options, traceroute sends packets
with an unbound UDP source port, and a monotonically increasing destination port:
23:31:27.128545 IP 192.168.1.25.55532 > 142.250.200.206.33434: UDP, length 32
23:31:27.128578 IP 192.168.1.25.59254 > 142.250.200.206.33435: UDP, length 32
23:31:27.128596 IP 192.168.1.25.50618 > 142.250.200.206.33436: UDP, length 32
If intermediate routers have multiple next hops for the route matching
142.250.200.206, known as Equal Cost MultiPath(ECMP),
that could certainly be an issue.
N possible next hops from the matched route. It will base its decision
on a hash function applied to certain values of the packet. The most common
configuration is to use the 5-tuple hash (IP src, IP dst, protocol, L4 src port, L4 dst port).
The router will do a modulo N operation over the hash to select the index
of the next hop to forward the packet to. This ensures packets belonging
to the same flow follow the same path towards a destination (in absence of reconvergences),
and are not reordered. The TCP state-machine is no fan of re-orderings…traceroute actually tells us that there is ECMP in parts of the network.
Probes to certain hops have more than one IP address, e.g. at hop 8 we see
two different routers, 74.125.245.171 for probe #1, and 108.170.225.251
for probes #2 and #3:
8 74.125.245.171 17.053 ms 108.170.225.251 14.295 ms 17.687 ms
Ok, let’s use TCP probing instead, which allows us to fix the destination 2 and the source port:
sudo traceroute -T --sport=35000 -n 142.250.200.206
23:53:10.410524 IP 192.168.1.25.35000 > 142.250.200.206.80: Flags [S], [...]
23:53:10.415993 IP 192.168.1.25.35000 > 142.250.200.206.80: Flags [S], [...]
23:53:10.418776 IP 192.168.1.25.35000 > 142.250.200.206.80: Flags [S], [...]
The results:
traceroute to 142.250.200.206 (142.250.200.206), 30 hops max, 60 byte packets
1 192.168.1.1 5.419 ms 2.732 ms 2.773 ms
2 81.46.38.147 43.087 ms 9.349 ms 24.899 ms
3 81.46.34.125 15.319 ms 15.933 ms 14.927 ms
4 * * *
5 * * *
6 * * *
7 81.173.106.65 13.715 ms 37.831 ms 14.513 ms
8 142.251.231.147 14.288 ms 13.896 ms 14.388 ms
9 142.250.56.72 16.127 ms 16.108 ms 14.951 ms
10 209.85.242.91 15.183 ms 14.902 ms 14.363 ms
11 172.253.65.53 70.040 ms 43.351 ms 43.109 ms
12 * 142.251.254.74 43.128 ms 40.690 ms
13 192.178.105.161 40.901 ms 40.438 ms 40.415 ms
14 209.85.143.123 40.875 ms 40.392 ms 40.481 ms
15 142.250.200.206 40.234 ms 44.056 ms 41.273 ms
As you can see, the packet follows a different path. This (re)confirms that there is ECMP in the network (or that it has reconverged).
But the RTT to the 8th hop is still lower than for instance the 3d hop! Moreover,
the RTT to 81.46.34.125 is now significantly lower than in the previous traceroute.
Asymmetric routing? #
IP routing (in general) is not symmetric. The path from A → B at a given point
in time might not be the same as B → A. Could it be that the path back from
81.46.34.125 specifically was different and way slower than the rest?
It could, but observations don’t seem to point to that. Doing a series of traceroutes
to the same destination using UDP gives us a wide-range of RTTs to 81.46.34.125
from “as low” as 24ms all the way to… 144ms! If there was a problem on the way
back, RTTs should be reasonably stable.
Are we missing something fundamental here…? #
There is something else that is bothering me here; RTT times don’t seem to correlate with the distance packets travel.
I am based in 📍Barcelona (Catalonia, Spain).
The IP 108.170.252.173, the 10th hop of the very first traceroute, is
announced by AS15169 (Google). I couldn’t reliably
geo-localize this IP (3 different webs gave me east, center and
west USA…), but assuming for a moment it is the center of USA 📍Wichita(Kansas, USA),
one of the locations given, that is some 8.000 kms in straightline from Barcelona.
The propagation time, assuming neglible (many) other factors, would be:
$$ RTT = 2 * 8,000 kms / {\frac{299,792 kms}{1 sec}} =~ 53.4 ms $$
(OK, that IP is probably closer to Europe than that 😅)
The 3d hop, 81.46.34.125, which is reasonably well geolocalized to
the beatiful city of 📍Madrid (Spain),
some 600 Kms away from Barcelona (or ~4ms), gives almost 1.5x the RTT to the
middle of USA.
We seem to be measuring something more than purely the RTT, something that
sets traceroute RTT measurements way way off.
And the answer is… #
We are also measuring the time the router takes to generate and send the ICMP packet back (processing time). And, due to the hardware architecture of many routers, this is usually NOT done in the Forwarding Plane, but rather in the Control Plane:
Routers: Control Plane protections and their implications
TTL=0 packets trigger an exception
in the Forwarding Plane, and the offending packet is sent to the Control Plane.
Packets destined to the Control Plane are heavily rate-limited and QoS enforced,
and can even be lost. In fact, this is likely what happened to the first probe
of the 12th hop using TCP:
12 * 142.251.254.74 43.128 ms 40.690 ms
At the same time, the Control Plane might postpone the low-priority task of
generating the ICMP time exceeded packet if it has more urgent things to do.
So what we are seeing here is that the router 81.46.34.125 is much slower
in generating ICMP messages than the routers in hops 4th-10th.
Case closed! 🥳
Bonus: is it even possible to measure RTTs without involving the Control Plane? #
Yes, sometimes.
A good example is Bidirectional Forwarding Detection. BFD, under a set of very specific conditions, can reliably estimate the RTT at the dataplane level with a L3 adjacent neighbour by using the Echo mode.
In this mode, the router injects BFD packets to the interfaces where it expects BFD peers, with the source and destination IPs set to its own IP address. This forces the directly connected peer router to hair-pin the traffic back to the ingress interface. The packet is sent back directly from the Forwarding Plane, and is a proof that the link is working. In addition, a timestamp can be added when sending the packet, so that the RTT can be reliably calculated when the packet is echoed without the need to keep clocks in tight synchronization between devices.
The echo mode only works for directly connected peers, though (one IP hop away).
Next #
That… was fun! I hope you liked it too. 😀