Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What rr does (rr-project.org)
880 points by mmarq on June 4, 2022 | hide | past | favorite | 81 comments


Now this is how the project documentation should look like! Explains what the project does, gives some short examples of the most common features, then goes into the details - while being easy to understand for the target audience. Kudos to whoever wrote this! (and rr sounds like a nice tool too ;) )


I fully agree. I wish more intro-level documentation had this kind of easy to follow and progressive level of detail.

All too many things like this either dive straight into the deep end inundating you with superfluous details when all you want is a primer, or provide so little information as to be nearly useless.

The writers on this did a great job.


Historically, rr has not worked on AMD processors, which is a bummer. However, I have been able to make good use of it on my 5950X now with the workaround script and newer versions of rr. This is good news.

I've not read their extended technical report, but I am kind of curious exactly what performance counters AMD is implementing poorly and how that impacts rr.


vchuravy's link gives the details, but basically, there's a microarchitecture optimization in Zen that breaks determinism of the performance counters. Fortunately, there's a chicken bit that turns it off, which is what the script does. I've been trying to convince AMD to officially document the bit such that the kernel can set it automatically, but no luck so far.

There is still one remaining annoyance, which is that AMD's NMI latency is super high, which directly tanks rr's reverse execution latency. There's probably some improvements that could be made to the replayer to be more aggressive about optimistic assumptions on NMI latency and retrying if those assumptions are off, but it'd be a fair bit of work. I don't really understand why AMD decided to use this kind of architecture. It also makes profiles much less accurate.


Thanks for the info. I was wondering what was going on in that script. It’s unfortunate that their architectural decisions had to impact rr, but I guess these days, every last bit of benchmark score really matters.



This is incredible!

For those that have used it, how useful it is for debugging multithreading heisenbugs? Can I let a process run under rr for days, wait until it crashes to due a heisenbug, and replay the trace without rr having to go through days of recording? i.e. is it possible to fast forward the trace, somehow?

(I nerd sniped myself a bit here, wondering how fast forwarding could be implemented. I think it might be achievable with periodic process memory snapshots and incremental traces.)


You probably could record a process running for days but it would also take days to replay to the end, which would not be much fun. We don't create checkpoints during the recording.

You'd be better off restarting the recording periodically. Also, rr has a "chaos mode" which randomizes scheduling and often makes threading bugs easier to reproduce. https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mo...


Chaos mode works quite well in my experience, definitely worth trying if you don't want to wait for days.

I had a heisenbug which would appear once a week, and that I couldn't trigger on my workstation. Chaos mode did the trick.


I've found that the scheduling quanta with chaos mode are too high to hit concurrency issues in a reasonable amount of time. And IIUC --num-cpu-ticks is not randomized. So if something happens below that tick quantum it's hard to hit.

I wonder if a) rr could randomize the cpu ticks as well, at least in chaos mode, b) profiled code could somehow hint to rr that a certain instruction would be an "interesting" scheduling point.


Chaos mode varies the scale of the tick quantum to try to catch stuff like that. It doesn't always work, especially if the window of vulnerability to the bug is incredibly small (e.g. a few instructions).


Hm. Is it possible that that works better with multi-threaded than multi-process programs?


That shouldn't make any difference.


Hmm, when I asked for 'replay -e' I thought it would be faster than 'type run and wait' -- is it not?


No.


I have not used rr heavily, but I did use it to help find a multithreading heisenbug in BIND https://kb.isc.org/docs/aa-01606

I could not reproduce the bug in less than an hour of run time, which meant that analysing the bug in gdb required an hour for it to run forward to the crash point, after which it was possible to skip back and forth.


rr numbers each 'event' it records, and you can pass an event number to the gdb 'run' command to tell it to start from that event. Recent 'rr' now also supports the -e option to replay meaning 'start the debug session pointing at the last recorded event, whatever that was'. Details in the usage page: https://github.com/rr-debugger/rr/wiki/Usage

AIUI you get 'start at an event' basically for free, because 'step backward' is implemented as 'start at the preceding event and then step forward by N', so events are frequent in the trace and the machinery to get to that point without running all the way from the start of the debug session exists anyway. There's some stuff on the website about how this is all implemented, I think.


I haven't used it, but it might be quite useful. It does force code to run on a single core, so you won't get truly concurrent execution which I guess might hide some multithreading bugs. On the other hand it does come with "chaos mode" which is basically thread schedule fuzzing.

You can "fast forward" the trace as you imagine. rr works by recording all non-deterministic input and output to the program so it can start from the core dump and step backwards.

As I understand it anyway; I've never actually used it - the one time I really wanted something like rr was on a Mac.


> You can "fast forward" the trace as you imagine. rr works by recording all non-deterministic input and output to the program so it can start from the core dump and step backwards.

Not exactly. rr can't magically inflate a core dump into all the open file descriptors and other state accumulated during a process's execution. It needs to run from the beginning.

So starting from the beginning, you can let it run to any arbitrary point. (And there are ways of knowing useful points, eg if you record with -M it will print out event counts with anything written to stdout/stderr, so you can quickly run with -g to start debugging at the point that message was emitted.) But it does still need to run from the beginning. And you're recording a whole process tree, you need to start from the initial process and let it go forward to your requested point in time.

In practice, I usually use it by starting a replay, continuing forward to a crash (or a breakpoint at some line if it didn't crash), and only then starting to pay attention to what's going on. It's a simple, muscle-memory process to get to that point, and if it was a long recording you kind of start it up and wait until it's ready. (Which will take roughly as long as the initial run took to get to the same point. A little slower because of the overhead, a little faster because it doesn't actually have to wait for I/O, averaging out to a mostly unnoticeable amount slower.)

I always have to mention: one of my favorite things about rr is something that doesn't even require all the sophisticated machinery. I often want to debug a single process within a whole process tree, and with most things there aren't --debugger flags (or they're broken). With rr, I can just record the whole tree, then pick out the process I care about after the fact. It's a small thing, but it saves me from my usual hairball of wrapper scripts.

Random example: when debugging a gcc plugin, I record a call to gcc, but the actual compile I care about is done by a forked cc1plus process.


> For those that have used it, how useful it is for debugging multithreading heisenbugs?

Not that useful, because as it says on the page (under "Limitations"):

> emulates a single-core machine.


Multithreading does not require multiple processors, it has existed since long before SMP was a thing.


But rr does squash all your threads into a single virtual cpu core. It context–switches between them, but ultimately only one of them is running at a time. This makes it hard to capture some kinds of bugs. To compensate it also has a chaos mode that randomly stops switching between the threads fairly (starving some and giving others more than their fair share) in the hopes of triggering those same bugs.

For most uses rr is a major win, but for race conditions it sometimes doesn’t help.


And a lot of this multi-threaded software that previously worked fine on single-core machines, required urgent fixes as multi-core CPUs became more common.


Only somewhat useful as it runs singlethreaded, last I checked. That will prevent some forms of heisenbugs from happening.


Next question: can I do the same with a multi-node system?


rr is a life changing experience for debugging things. One underrated thing is being able to save and share rr traces. rr + CI makes finding and potentially fixing heisenbugs a lot easier.


This kind of replayable debugging can be wonderful - especially for hard to debug issues like heap corruption and such.

Windows has something similar called Time Travel Debugging[1] but in my experience the dump files it creates can be enormous and be a pain to analyze as a result. (It also relies on WinDbg which while being extremely powerful and capable, has a huge learning and usability cliff. I’ve been using it for over a decade and I still need a cheat sheet from time to time. The revamped WinDbg Preview[2] improves the UI a lot, but ultimately it’s still WinDbg.)

[1] https://docs.microsoft.com/en-us/windows-hardware/drivers/de...

[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/de...



Thanks! Macroexpanded:

Instant replay: Debugging C and C++ programs with rr - https://news.ycombinator.com/item?id=27034588 - May 2021 (66 comments)

Using time travel to remotely debug faulty DRAM - https://news.ycombinator.com/item?id=24589597 - Sept 2020 (62 comments)

Time Traveling Linux Bug Reporting: Coming in Julia 1.5 - https://news.ycombinator.com/item?id=23069372 - May 2020 (21 comments)

rr: lightweight recording and deterministic debugging - https://news.ycombinator.com/item?id=18388879 - Nov 2018 (52 comments)

Rr 5.0 Released - https://news.ycombinator.com/item?id=15191445 - Sept 2017 (3 comments)

Debugging Leaks with rr - https://news.ycombinator.com/item?id=10573308 - Nov 2015 (4 comments)

Back to the Futu-Rr-e: Deterministic Debugging with Rr - https://news.ycombinator.com/item?id=10492664 - Nov 2015 (9 comments)

Rr 4.0 Debugger Released with Reverse Execution - https://news.ycombinator.com/item?id=10441618 - Oct 2015 (11 comments)

Rr records nondeterministic executions and debugs them deterministically - https://news.ycombinator.com/item?id=8817954 - Dec 2014 (9 comments)

Rr 3.0 Released with x86-64 Support - https://news.ycombinator.com/item?id=8734502 - Dec 2014 (6 comments)

Porting rr to x86-64 - https://news.ycombinator.com/item?id=8543624 - Nov 2014 (9 comments)


Sorry, I didn’t know that. I thought the site would recognise duplicated links.


From the FAQ <https://news.ycombinator.com/newsfaq.html>:

> Are reposts ok?

> If a story has not had significant attention in the last year or so, a small number of reposts is ok.


As jwilk correctly says, reposts are fine after a year or so. Pointing to previous links with comments is just to satisfy users who might be curious for more. You did good!


It's quite common for people to refer to old discussions (here 2+ years ago) for popular projects like rr.


Yes, because there may be something to learn from the old comments and the new. It's good. Different people comment in different eras.


Reposting is OK. I once got an email from HN staff inviting me to repost a link.


Breakpoint plus reverse watch is incredibly powerful. It makes it trivial to find the code that last modified a variable before a breakpoint.


Makes every day seem like Talk Like a Pirate Day.


Is there a counterpart of this in Python?



There was an attempt for Python 2, but it didn't catch on.


This sounds like a debugger I might actually enjoy using (unlike all the others).


rr is a superpower, but pernosco is several superpowers (https://pernos.co/; it’s built on rr). I recommend both!


I second this. Pernosco is just unbelievable.


How does this work, exactly? Is it recording every state change in the program?


No, that has been done but is much slower. It records all communication with the external world. The full answer is well described in https://arxiv.org/pdf/1705.05937.pdf which I'll quote here:

> We identify a boundary around state and computation, record all sources of nondeterminism within the boundary and all inputs crossing into the boundary, and reexecute the computation within the boundary by replaying the nondeterminism and inputs. If all inputs and nondeterminism have truly been captured, the state and computation within the boundary during replay will match that during recording.

So for any chunk of time spent entirely in user space doing computation, the replay starts out in the same situation and executes in exactly the same way the original process did, with zero overhead. That's what enables rr to be so low overhead overall; most programs spend the bulk of their time computing stuff and reading/writing memory. The replayed process has no way of knowing that its file descriptors aren't actually open, since anything it does with them will be provided by the recording. Quoting again:

> In particular, user-space memory and register values are preserved exactly, with a few exceptions noted later in the paper. This implies CPU-level control flow is identical between recording and replay, as is memory layout.


That is very cool, and very clever! :)


If you use https://pernos.co/ then you don't need any of this, but I have a set of only slightly buggy gdbinit scripts that extend the rr debugging experience at <https://github.com/hotsphink/sfink-tools>. The main things it adds are:

1. a `log` command that just records whatever you give it into a plaintext file, together with its "point in time" according to rr. This is useful because when using rr, you tend to move forward and backward in time a lot, and it's hard to keep track of the actual sequence of events and where you are within them. It also creates a checkpoint so you can return to any one of your log points. It also has some niceties like replacing any expression enclosed in curly brackets with the results of executing the gdb expression given, so you can do things like

    log starting execution of Init() with v={v}. About to crash.
2. a `label` command that lets you assign names to random hex values. Then in the output of `p expr` or the above `log` with no arguments, which displays the full set of log messages you've recorded, it will replace known hex values with their labels. This is so much nicer than memorizing numeric values and matching them up.

    (rr) p obj
    $1 = (JSObject*) 0x7f606892a200
    (rr) label OUTER_OBJ=obj
    (rr) p $OUTER_OBJ
    (JSObject*) $OUTER_OBJ
    (rr) log
    701/31299795 [c4] starting processing with obj=(JSObject*) $OUTER_OBJ
    983/31299 [c2] starting processing with obj=(JSObject*) 0x7f6068a2a200
    2081/7382911 [c3] traversing to (JSObject*) 0x7f6069c2a7e8
    3316/199 [c1] crashing while accessing field of object (JSObject*) $OUTER_OBJ
The [c2] markers are the automatically-created checkpoints, numbered in order that you made those log entries in the debugger. It reorders the log messages to show them in execution order rather than debugging order. Pernosco has a very similar feature called the Notebook (where you only have to click on a log entry to view the state at that point in time.)

The scripts are also intended for sharing log files and labels between multiple concurrent replays of the same execution, which I find useful to have separate windows each maintaining a different context (point in time, and portion of the execution that I'm examining.) That tends to be the buggier part of the scripts, though. ;-)

If you're working with C or C++ (or Rust? haven't tried it), rr really is a superpower. I rarely bother using straight gdb anymore. It feels crippled.


Does rr require debug builds? Like if I took a random executable on Linux and used rr record, would rr replay work?


It works with optimized builds, and it works better with them than gdb does.

When you debug an optimized build with debug info in gdb by stepping line by line, it is easy to accidentally step "too far" and completely lose your place. In rr, you can always step back and recover.


You might be relegated to stepping through disassembled machine code. I was able to use rr with a home made JIT compiler, stepping through JIT’d instructions. So I see no reason why you can’t at least get that experience with a production binary.


what I would give to have something like this that additionally worked on Mac and Windows too


Eager for the day someone integrates this into VS Code.



Wow, thanks. I will give it a try!!


FWIW, rr integrates into gdb, so it should be possible to use anything that integrates with gdb.

https://github.com/rr-debugger/rr/wiki/Using-rr-in-an-IDE


Man. I wish I had this for Typescript.

Well done!


Could you next time provide some small but meaningful description in the title? "Rr" is a little bit short in my opinion.


For those reading the above comment, but hadn't clicked the article link yet (like me):

>rr aspires to be your primary C/C++ debugging tool for Linux, replacing — well, enhancing — gdb. You record a failure once, then debug the recording, deterministically, as many times as you want. The same execution is replayed every time.

>rr also provides efficient reverse execution under gdb. Set breakpoints and data watchpoints and quickly reverse-execute to where they were hit.


It also becomes a bit of a cheat in terms of HN points, as mobile users miss the link and hit the up arrow next to it.

https://news.ycombinator.com/item?id=30906989


The RR debugger


maybe "rr, a gdb replacement"


That feels like it's underselling it a bit, since gdb does not have reverse execution, which is a pretty major contribution by rr.


AIUI, Gdb does claim reverse execution, for certain targets. So, there are differences, but I don't understand them.


gdb's reverse execution is incredibly slow.

Performance overhead of reverse debuggers:

* gdb: >1000x (note: I never tested this one myself; just heard about this overhead in a HN comment a long time ago)

* Microsoft WinDbg TimeTravel Debugger: >40x

* rr: 1.5x

rr is the only one fast enough to be used on a regular basis -- the others are slow enough that they only make sense on particularly nasty bugs (usually memory corruption)


I would be surprised if the overhead of TTD is typically 40x, given that it records multithreaded processes in parallel. Which, to my knowledge, rr does not.

It also supports selective recording so, if this is configured (e.g. selecting certain functions), only a subset of the process execution is actually committed to the trace file, further reducing the overhead.


I don't know about multithreaded processes -- the program I use it is single-threaded. My main use case is not to make the crashes reproducible (they usually already are), but to understand where a bogus value is coming from. (memory breakpoint + run backwards) Which often ends up being a certain third-party library that makes liberal use of C unions, sometimes accessing the wrong variant...

I'll have to look into selective recording, but I'm not sure how helpful it'll be in my use case (I don't know said library well enough to predict which functions might be causing the bogus values)


Would we recognize the blamed library?


There is also a commercial alternative, https://undo.io/solutions/products/udb/ though I have not used it myself and I don’t know what its overhead is. (I know some of the people who work on it.)


Using GDB’s reverse execution requires you to already be pretty sure where the bug you are looking for is, and then recording a very short portion of the program, preferably just 10k instructions or so. Recording for a whole second could easily take 15 minutes. But it does work well, within those limitations.


gdb claims it. I have not once ever gotten it to work, however. For anything seemingly larger than a trivial program, the reverse-execution state grows too big and needs to be pruned. I also don't think it supports such fancy things as "floating point".


> gdb does not have reverse execution

GDB does have reverse execution:

https://sourceware.org/gdb/current/onlinedocs/gdb/Reverse-Ex...


It doesn't look like it's a replacement, it's more a companion tool for gdb to deterministically record, replay and debug a process after the fact.


Yea, I wouldn’t call it a replacement. It acts as a GDB debugging target; basically you connect a GDB process to rr and GDB controls rr for you. (To confuse things further, the “rr replay” command starts both rr and GDB for you, so it can be difficult to see the seams.)


Can I edit it?


You could but really the title is fine, just like reposting after sufficient time is fine. It's ok for a title to require a click.


Normally yes, but maybe not so late after submission. Moderators can update it.


https://github.com/rr-debugger/rr#system-requirements :

"rr currently requires either:

    - An Intel CPU with Nehalem (2010) or later microarchitecture.
    - Certain AMD Zen or later processors (see https://github.com/rr-debugger/rr/wiki/Zen)"


We already have https://r-project.org. Now we have https://rr-project.org. So, https://rrr-project.org is next?


It is actually a geometric progression, so https://rrrr-project.org would be the next one.


Common misconception. Actually it's a Fibonacci sequence, so the next one really is https://rrr-project.org and then it's https://rrrrr-project.org.

This does also mean that there's https://-project.org, and that https://r-project.org secretly disambiguates into two different projects.


Looks like a cool debugging tool, but I clicked the link because I thought maybe it was related to R. Maybe modify the title to make it clearer?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: