Hacker Newsnew | past | comments | ask | show | jobs | submit | BigRedEye's commentslogin

At Perforator, we also started from Google's beautiful pprof, but then eliminated all nested repeated fields, converging to https://github.com/yandex/perforator/blob/main/perforator/pr.... Repeated fields in protobufs are really memory & CPU hungry.

This layout allows us to quickly merge hundreds of millions of samples into a single profile. The only practical limit is protobuf's 2GB message size cap.


We had reached out to y'all last year to explore taking ideas from your format. It definitely looks interesting! But IIRC nobody from your team ended up making it to one of our SIG meetings?

https://github.com/yandex/perforator/issues/13


I believe this is a case of convergent invention – the idea of pushing DWARF/.eh_frame unwinding into eBPF seems to have occurred to several people around the same time. For example, there's a working implementation discussed as early as March 2021: https://github.com/iovisor/bcc/issues/1234#issuecomment-7875...

At Yandex we have a similar profiler that supports native languages seamlessly, with addition to Python/Java: https://github.com/yandex/perforator. It's exciting to see new profilers from big players!


Great question! Perforator indeed looks similar to Pyroscope. However, we think that the closest existing solutions are https://parca.dev, closed-source Google Wide Profiling, and, speaking of the agent, the beautiful OpenTelemetry eBPF profiler. The main technical differences with Pyroscope we see are:

- Pyroscope's Java support is superior as of now because Pyroscope offloads it to the amazing async-profiler.

- Pyroscope expects native binaries to be compiled with frame pointers: https://grafana.com/docs/pyroscope/latest/configure-client/g.... This is often not the case, and that's the problem we've tried to solve with Perforator. Perforator uses .eh_frame, which is nearly universal and does not impose additional requirements on compiled binaries.

- Pyroscope symbolizes using symtab: https://grafana.com/docs/pyroscope/latest/configure-client/g.... We use DWARF/GSYM to get as correct and verbose stacks as possible (we benchmark our stacks against stacks from gdb).

- Pyroscope symbolizes profiles on an agent, while Perforator symbolizes profiles offline, greatly reducing symbolization costs and agent's overhead. It seems Pyroscope is heading toward the same architecture we use: https://github.com/grafana/pyroscope/pull/3799.

- Perforator can be (and should be!) run as a standalone replacement for perf record.

- Perforator supports sPGO profiles.

In summary, we try to implement native profiling almost perfectly. It's worth noting that Pyroscope is a mature, well-established product that integrates excellently with the Grafana ecosystem. We have just focused on different things: our focus has been on optimizing native code profiling and making it as accurate and low-overhead as possible.


A short discussion can be found here: https://news.ycombinator.com/item?id=42888185


Yes. Although we are studying CSSPO, which uses a mixed (LBR + software-sampled stacks) approach.


I'm familiar with the paper, but it doesn't improve the situation in terms of LBR availability on cloud providers, does it?


Yes, existing limitations apply. Without hardware LBR support, we cannot provide sPGO profiles. However, the basic profiling should work fine.


Blog is packed with information, thanks!

Isn't it the case that from stack traces it is rather impossible to read that function foo() is burning CPU cycles because it is memory-bound? And the reason could be rather somewhere else and not in that particular function - e.g. multiple other threads creating contention on the memory bus?

If so, doesn't this make the profile somewhat an invalid candidate for PGO?


It depends on the event that was sampled to generate the profiles. For example, if you sample instructions by collecting a stack trace every N instructions, you won't actually see foo() burning the CPU. However, if you look at CPU cycles, foo() will be very noticeable. Internally, we use sPGO profiles from sampling CPU cycles, not instructions.


Right, perhaps I was a little bit too vague but what I was trying to say is that by merely sampling the CPU cycles we cannot infer that the foo() was burning CPU because it was memory-bound and which in itself is not an artifact of foo() implementation but rather application-wide threads that happen to saturate the memory bus more quickly.

Or is my doubt incorrect?


22 minutes for a medium-sized repo is probably slow enough to optimize.


However for these large size repositories. I'm not sure that you fit in the effective context window. I know that there is option to limit the token but then this would be your realistic limit.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: