Now this is how the project documentation should look like! Explains what the project does, gives some short examples of the most common features, then goes into the details - while being easy to understand for the target audience. Kudos to whoever wrote this! (and rr sounds like a nice tool too ;) )
I fully agree. I wish more intro-level documentation had this kind of easy to follow and progressive level of detail.
All too many things like this either dive straight into the deep end inundating you with superfluous details when all you want is a primer, or provide so little information as to be nearly useless.
Historically, rr has not worked on AMD processors, which is a bummer. However, I have been able to make good use of it on my 5950X now with the workaround script and newer versions of rr. This is good news.
I've not read their extended technical report, but I am kind of curious exactly what performance counters AMD is implementing poorly and how that impacts rr.
vchuravy's link gives the details, but basically, there's a microarchitecture optimization in Zen that breaks determinism of the performance counters. Fortunately, there's a chicken bit that turns it off, which is what the script does. I've been trying to convince AMD to officially document the bit such that the kernel can set it automatically, but no luck so far.
There is still one remaining annoyance, which is that AMD's NMI latency is super high, which directly tanks rr's reverse execution latency. There's probably some improvements that could be made to the replayer to be more aggressive about optimistic assumptions on NMI latency and retrying if those assumptions are off, but it'd be a fair bit of work. I don't really understand why AMD decided to use this kind of architecture. It also makes profiles much less accurate.
Thanks for the info. I was wondering what was going on in that script. It’s unfortunate that their architectural decisions had to impact rr, but I guess these days, every last bit of benchmark score really matters.
For those that have used it, how useful it is for debugging multithreading heisenbugs? Can I let a process run under rr for days, wait until it crashes to due a heisenbug, and replay the trace without rr having to go through days of recording? i.e. is it possible to fast forward the trace, somehow?
(I nerd sniped myself a bit here, wondering how fast forwarding could be implemented. I think it might be achievable with periodic process memory snapshots and incremental traces.)
You probably could record a process running for days but it would also take days to replay to the end, which would not be much fun. We don't create checkpoints during the recording.
I've found that the scheduling quanta with chaos mode are too high to hit concurrency issues in a reasonable amount of time. And IIUC --num-cpu-ticks is not randomized. So if something happens below that tick quantum it's hard to hit.
I wonder if a) rr could randomize the cpu ticks as well, at least in chaos mode, b) profiled code could somehow hint to rr that a certain instruction would be an "interesting" scheduling point.
Chaos mode varies the scale of the tick quantum to try to catch stuff like that. It doesn't always work, especially if the window of vulnerability to the bug is incredibly small (e.g. a few instructions).
I could not reproduce the bug in less than an hour of run time, which meant that analysing the bug in gdb required an hour for it to run forward to the crash point, after which it was possible to skip back and forth.
rr numbers each 'event' it records, and you can pass an event number to the gdb 'run' command to tell it to start from that event. Recent 'rr' now also supports the -e option to replay meaning 'start the debug session pointing at the last recorded event, whatever that was'. Details in the usage page: https://github.com/rr-debugger/rr/wiki/Usage
AIUI you get 'start at an event' basically for free, because 'step backward' is implemented as 'start at the preceding event and then step forward by N', so events are frequent in the trace and the machinery to get to that point without running all the way from the start of the debug session exists anyway. There's some stuff on the website about how this is all implemented, I think.
I haven't used it, but it might be quite useful. It does force code to run on a single core, so you won't get truly concurrent execution which I guess might hide some multithreading bugs. On the other hand it does come with "chaos mode" which is basically thread schedule fuzzing.
You can "fast forward" the trace as you imagine. rr works by recording all non-deterministic input and output to the program so it can start from the core dump and step backwards.
As I understand it anyway; I've never actually used it - the one time I really wanted something like rr was on a Mac.
> You can "fast forward" the trace as you imagine. rr works by recording all non-deterministic input and output to the program so it can start from the core dump and step backwards.
Not exactly. rr can't magically inflate a core dump into all the open file descriptors and other state accumulated during a process's execution. It needs to run from the beginning.
So starting from the beginning, you can let it run to any arbitrary point. (And there are ways of knowing useful points, eg if you record with -M it will print out event counts with anything written to stdout/stderr, so you can quickly run with -g to start debugging at the point that message was emitted.) But it does still need to run from the beginning. And you're recording a whole process tree, you need to start from the initial process and let it go forward to your requested point in time.
In practice, I usually use it by starting a replay, continuing forward to a crash (or a breakpoint at some line if it didn't crash), and only then starting to pay attention to what's going on. It's a simple, muscle-memory process to get to that point, and if it was a long recording you kind of start it up and wait until it's ready. (Which will take roughly as long as the initial run took to get to the same point. A little slower because of the overhead, a little faster because it doesn't actually have to wait for I/O, averaging out to a mostly unnoticeable amount slower.)
I always have to mention: one of my favorite things about rr is something that doesn't even require all the sophisticated machinery. I often want to debug a single process within a whole process tree, and with most things there aren't --debugger flags (or they're broken). With rr, I can just record the whole tree, then pick out the process I care about after the fact. It's a small thing, but it saves me from my usual hairball of wrapper scripts.
Random example: when debugging a gcc plugin, I record a call to gcc, but the actual compile I care about is done by a forked cc1plus process.
But rr does squash all your threads into a single virtual cpu core. It context–switches between them, but ultimately only one of them is running at a time. This makes it hard to capture some kinds of bugs. To compensate it also has a chaos mode that randomly stops switching between the threads fairly (starving some and giving others more than their fair share) in the hopes of triggering those same bugs.
For most uses rr is a major win, but for race conditions it sometimes doesn’t help.
And a lot of this multi-threaded software that previously worked fine on single-core machines, required urgent fixes as multi-core CPUs became more common.
rr is a life changing experience for debugging things. One underrated thing is being able to save and share rr traces. rr + CI makes finding and potentially fixing heisenbugs a lot easier.
This kind of replayable debugging can be wonderful - especially for hard to debug issues like heap corruption and such.
Windows has something similar called Time Travel Debugging[1] but in my experience the dump files it creates can be enormous and be a pain to analyze as a result. (It also relies on WinDbg which while being extremely powerful and capable, has a huge learning and usability cliff. I’ve been using it for over a decade and I still need a cheat sheet from time to time. The revamped WinDbg Preview[2] improves the UI a lot, but ultimately it’s still WinDbg.)
As jwilk correctly says, reposts are fine after a year or so. Pointing to previous links with comments is just to satisfy users who might be curious for more. You did good!
No, that has been done but is much slower. It records all communication with the external world. The full answer is well described in https://arxiv.org/pdf/1705.05937.pdf which I'll quote here:
> We identify a boundary around state and computation, record all sources of nondeterminism within the boundary and all inputs crossing into the boundary, and reexecute the computation within the boundary by replaying the nondeterminism and inputs. If all inputs and nondeterminism have truly been captured, the state and computation within the boundary during replay will match that during recording.
So for any chunk of time spent entirely in user space doing computation, the replay starts out in the same situation and executes in exactly the same way the original process did, with zero overhead. That's what enables rr to be so low overhead overall; most programs spend the bulk of their time computing stuff and reading/writing memory. The replayed process has no way of knowing that its file descriptors aren't actually open, since anything it does with them will be provided by the recording. Quoting again:
> In particular, user-space memory and register values are preserved exactly, with a few exceptions noted later in the paper. This implies CPU-level control flow is identical between recording and replay, as is memory layout.
If you use https://pernos.co/ then you don't need any of this, but I have a set of only slightly buggy gdbinit scripts that extend the rr debugging experience at <https://github.com/hotsphink/sfink-tools>. The main things it adds are:
1. a `log` command that just records whatever you give it into a plaintext file, together with its "point in time" according to rr. This is useful because when using rr, you tend to move forward and backward in time a lot, and it's hard to keep track of the actual sequence of events and where you are within them. It also creates a checkpoint so you can return to any one of your log points. It also has some niceties like replacing any expression enclosed in curly brackets with the results of executing the gdb expression given, so you can do things like
log starting execution of Init() with v={v}. About to crash.
2. a `label` command that lets you assign names to random hex values. Then in the output of `p expr` or the above `log` with no arguments, which displays the full set of log messages you've recorded, it will replace known hex values with their labels. This is so much nicer than memorizing numeric values and matching them up.
(rr) p obj
$1 = (JSObject*) 0x7f606892a200
(rr) label OUTER_OBJ=obj
(rr) p $OUTER_OBJ
(JSObject*) $OUTER_OBJ
(rr) log
701/31299795 [c4] starting processing with obj=(JSObject*) $OUTER_OBJ
983/31299 [c2] starting processing with obj=(JSObject*) 0x7f6068a2a200
2081/7382911 [c3] traversing to (JSObject*) 0x7f6069c2a7e8
3316/199 [c1] crashing while accessing field of object (JSObject*) $OUTER_OBJ
The [c2] markers are the automatically-created checkpoints, numbered in order that you made those log entries in the debugger. It reorders the log messages to show them in execution order rather than debugging order. Pernosco has a very similar feature called the Notebook (where you only have to click on a log entry to view the state at that point in time.)
The scripts are also intended for sharing log files and labels between multiple concurrent replays of the same execution, which I find useful to have separate windows each maintaining a different context (point in time, and portion of the execution that I'm examining.) That tends to be the buggier part of the scripts, though. ;-)
If you're working with C or C++ (or Rust? haven't tried it), rr really is a superpower. I rarely bother using straight gdb anymore. It feels crippled.
It works with optimized builds, and it works better with them than gdb does.
When you debug an optimized build with debug info in gdb by stepping line by line, it is easy to accidentally step "too far" and completely lose your place. In rr, you can always step back and recover.
You might be relegated to stepping through disassembled machine code. I was able to use rr with a home made JIT compiler, stepping through JIT’d instructions. So I see no reason why you can’t at least get that experience with a production binary.
For those reading the above comment, but hadn't clicked the article link yet (like me):
>rr aspires to be your primary C/C++ debugging tool for Linux, replacing — well, enhancing — gdb. You record a failure once, then debug the recording, deterministically, as many times as you want. The same execution is replayed every time.
>rr also provides efficient reverse execution under gdb. Set breakpoints and data watchpoints and quickly reverse-execute to where they were hit.
* gdb: >1000x (note: I never tested this one myself; just heard about this overhead in a HN comment a long time ago)
* Microsoft WinDbg TimeTravel Debugger: >40x
* rr: 1.5x
rr is the only one fast enough to be used on a regular basis -- the others are slow enough that they only make sense on particularly nasty bugs (usually memory corruption)
I would be surprised if the overhead of TTD is typically 40x, given that it records multithreaded processes in parallel. Which, to my knowledge, rr does not.
It also supports selective recording so, if this is configured (e.g. selecting certain functions), only a subset of the process execution is actually committed to the trace file, further reducing the overhead.
I don't know about multithreaded processes -- the program I use it is single-threaded.
My main use case is not to make the crashes reproducible (they usually already are), but to understand where a bogus value is coming from. (memory breakpoint + run backwards) Which often ends up being a certain third-party library that makes liberal use of C unions, sometimes accessing the wrong variant...
I'll have to look into selective recording, but I'm not sure how helpful it'll be in my use case (I don't know said library well enough to predict which functions might be causing the bogus values)
There is also a commercial alternative, https://undo.io/solutions/products/udb/ though I have not used it myself and I don’t know what its overhead is. (I know some of the people who work on it.)
Using GDB’s reverse execution requires you to already be pretty sure where the bug you are looking for is, and then recording a very short portion of the program, preferably just 10k instructions or so. Recording for a whole second could easily take 15 minutes. But it does work well, within those limitations.
gdb claims it. I have not once ever gotten it to work, however. For anything seemingly larger than a trivial program, the reverse-execution state grows too big and needs to be pruned. I also don't think it supports such fancy things as "floating point".
Yea, I wouldn’t call it a replacement. It acts as a GDB debugging target; basically you connect a GDB process to rr and GDB controls rr for you. (To confuse things further, the “rr replay” command starts both rr and GDB for you, so it can be difficult to see the seams.)