measure downtime durations via icmp
How long does it take for the failover mechanism to fail-over? How long does it take for my server to be up again?
I recently had the challenge of measuring downtime durations. ping on the one hand “makes me see” when paket loss occurs and vanishes but lacks in aggregation. mtr on the other side aggregates the replies, but not for the metric I was looking for.
So I made up my own tool namely
downtime. Aside the plain measurement logic, I tried a few new things like external lib interfaces and mocks which will shortly be addressed in the end.
Ping the host via icmp using a customized timeout and a small interval to pinpoint the start and end timestamp of the downtime. This can then be used to calculate the duration.
The timeout should be as low as possible. The RTT of my service was not higher than 25ms, so 50ms timeout should be good to go. It should also tolerate some spikes that might appear randomly without falsifying our measurements. The interval should be pretty low and defines the error of the downtime.
The icmp logic is based on my colleagues mtr go-rewrite providing me the most high-level icmp interface I could find. This enables me to use raw icmp pakets without caring about the low level stuff like encoding the PID into the request body to fish out the corresponding replies from the ocean of icmp traffic.
To answer the second of the two initial questions “How long does it take for my server to be up again after?”:
- start downtime with server ip
- restart server
- wait until server is up again and downtime terminated
For my cloud server with no further setup this took 17s:
$ downtime --target 23.88.32.x start: 2023-09-22 12:27:31.751644064 +0200 CEST m=+11.151409405 end: 2023-09-22 12:27:49.278929522 +0200 CEST m=+28.678694794 duration: 17.527285389s
Aside from the measurement logic, I tried a few things I’ve learned over the years to improve extensibility and maintainability.
- The dependency to the external lib is documented at one place. Usually you’d need to scan the whole code base for the external lib to check whcih functions are in use.
- Testing of complex external functions gets a lot easier. E.g. my unit test don’t need to do real icmp stuff using mocks.
Interfaces can easily be mocked using a lib like mockery. E.g. icmp mock for my icmp interface. As mentioned previously the mock can be used to test the lib without calling the external library. The goal of testing should be to test the own and not the external code.
I’ve decided to provide the package as both: lib and binary. The only quirk is that the install url ends with
downtime/cmd/downtime, but hence all people would copy the go install command from my readme anyway, it doesn’t really matter.
In the end, the binary is just a specialized implementation of the watcher and reply checker. The watcher gets called two times: first time blocking until the first reply times out and the second time blocking until the first reply succeeds again. User of the lib can bend this my making they’re own checker.