“We recently had an incident where we had a Redis instance blocked on writing to disk, hung inside the kernel.
Through a variety of other circumstances this meant the only up to date copy of business critical data was in memory on a single machine, multiple independent replicas and backups were unavailable, and that machine could crash at any moment.
The problem was eventually solved without data loss of any kind due to our Techops team being utterly brilliant, but during the “incident” a number of ideas that would otherwise be dismissed as absolutely crazy were discussed.
One of the ideas was based on the simple idea that redis was an in-memory database. We had access to the machine. And the repeated question came up of just copying all the memory off of the machine, or at least out of the process, and wrest our data from it’s grasp.
Although mostly not serious, at least to start with, as options lessened the idea of taking a core-dump and picking it apart looked more and more like a straw worth clutching at.
It ended up not being needed, which is a good thing, but I figured out that it was at least possible, if you are really really really desperate, and have some time on your hands. On the off-chance anyone is really that desperate in the future, I thought I’d at least list out what direction you would need to take, and wish you the best of luck.
(Past this point I’m going to assume an exceptional knowledge of C, a good knowledge of GDB, ELF loading, symbol lookup, the amd64 ABI used by Linux, memory layout and generally how the dynamic linker works. If interested check out the osdev.org wiki, especially on linking, ELF loaders etc and http://www.x86-64.org/documentation/abi.pdf – Or just fake it and read on)
Given you’re at point of desperation, you’re probably going to be limited in either time or what you can do. At the very least you are going to want to get a core dump of the redis process, taking into account that it’s going to have to write to somewhere and if you can’t just do a redis SAVE, then there is a chance that you can’t write to disk.
If you have to write to /tmp using tmpfs, realise that you might run out of memory at some point when writing it, and although the OOM killer probably won’t kick in, be very very sure first. You might be able to use the remote gdb stub/server to dump out the core over the network rather than locally in a pinch…”