Our current thinking is to focus heavily on automating triage across syslogs and bag/mcap files, since that’s where the hours really get burned, even for experienced folks. For interpretation, we see it more as an assistive layer (e.g., surfacing “likely causes” or linking to past incidents), rather than trying to replace domain expertise.
Do you think there are specific triage workflows where even a small automation (say, correlating error timestamps across syslog and bag files) would save meaningful time?
One thing that comes to mind is checking the timestamps across sensors and other topics. Two cases come to mind:
* I was setting up Ouster lidar to use gos time, don’t remember the details now but it was reporting the time ~32 seconds in the past (probably some leap seconds setting?)
* I had a ROS node misbehaving in some weird ways - it turned out there was a service call to insert something into db and for some reason the db started taking 5+ minutes to complete which wasn’t really appropriate for a blocking call
I think the timing is one thing that needs to be consistently done right on every platform. The other issues I came across were very application specific.
Do you think there are specific triage workflows where even a small automation (say, correlating error timestamps across syslog and bag files) would save meaningful time?