Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does the profiler read any of the GPU's performance counters? Would be super cool to have an open source tool that can capture the same data nsight compute does.


This profiler is focused on kernel execution but we do scrape high level metrics (https://www.polarsignals.com/blog/posts/2025/06/04/latest-in... which is based on https://github.com/polarsignals/gpu-metrics-agent). What performance counters in particular were you interested in?


Cache hit rate is probably the most immediately useful. Although given that this is for always-on profiling maybe this project isn't as geared towards optimizing kernels as I originally thought? In theory reading the counters should be low overhead though.


It depends on what counter.

[ All from my experience on home GPUs, and in lah with 2 nodes with 2 80GB H100 each. Not extensively benchmarked ]

Events like kernel launch, which this profiler reads right now, is a very small overhead (1-2%). Kernel level metrics like DRAM utilisation, cache hit rate, SM occupancy, etc usually give you a 5-10% overhead. If you want to plot a flame graph at a instruction level (mostly useful for learning purposes) then you go off the rails - even 25% overhead I have seen. And finally full traces add tons of overhead but that's pretty much expected - they anyways produce GBs of profiling data.


Occupancy and RAM utilization are available from static analysis. A sampling profiler would also obviously not be suitable for this always-on profiler case. But reading the counters [0] from the GSP should be cheap.

[0] https://en.wikipedia.org/wiki/Hardware_performance_counter




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: