< up >
2023-09-22

measure downtime durations via icmp

Construction site at the Sendlinger Tor in Munich.

How long does it take for the failover mechanism to fail-over? How long does it take for my server to be up again?

I recently had the challenge of measuring downtime durations. ping on the one hand “makes me see” when paket loss occurs and vanishes but lacks in aggregation. mtr on the other side aggregates the replies, but not for the metric I was looking for.

So I made up my own tool namely downtime. Aside the plain measurement logic, I tried a few new things like external lib interfaces and mocks which will shortly be addressed in the end.

How it works
Example
Implementation design

How it works

Ping the host via icmp using a customized timeout and a small interval to pinpoint the start and end timestamp of the downtime. This can then be used to calculate the duration.

The timeout should be as low as possible. The RTT of my service was not higher than 25ms, so 50ms timeout should be good to go. It should also tolerate some spikes that might appear randomly without falsifying our measurements. The interval should be pretty low and defines the error of the downtime.

The icmp logic is based on my colleagues mtr go-rewrite providing me the most high-level icmp interface I could find. This enables me to use raw icmp pakets without caring about the low level stuff like encoding the PID into the request body to fish out the corresponding replies from the ocean of icmp traffic.

Example

To answer the second of the two initial questions “How long does it take for my server to be up again after?”:

start downtime with server ip
restart server
wait until server is up again and downtime terminated

For my cloud server with no further setup this took 17s:

$ downtime --target 23.88.32.x
start: 2023-09-22 12:27:31.751644064 +0200 CEST m=+11.151409405
end: 2023-09-22 12:27:49.278929522 +0200 CEST m=+28.678694794
duration: 17.527285389s

Implementation design

Aside from the measurement logic, I tried a few things I’ve learned over the years to improve extensibility and maintainability.

External lib interface

Define an interface for an external lib only containing needed functions. E.g. icmp.go for my dependency to tonobo’s SendICMP. While beeing a bit verbose in the first place, it has two advantages:

The dependency to the external lib is documented at one place. Usually you’d need to scan the whole code base for the external lib to check whcih functions are in use.
Testing of complex external functions gets a lot easier. E.g. my unit test don’t need to do real icmp stuff using mocks.

Mocks

Interfaces can easily be mocked using a lib like mockery. E.g. icmp mock for my icmp interface. As mentioned previously the mock can be used to test the lib without calling the external library. The goal of testing should be to test the own and not the external code.

Extensibility

I’ve decided to provide the package as both: lib and binary. The only quirk is that the install url ends with downtime/cmd/downtime, but hence all people would copy the go install command from my readme anyway, it doesn’t really matter.

In the end, the binary is just a specialized implementation of the watcher and reply checker. The watcher gets called two times: first time blocking until the first reply times out and the second time blocking until the first reply succeeds again. User of the lib can bend this my making they’re own checker.