How effective are LLMs at triaging issues? Has anyone found success using them to find the root cause? I've only been able to triage effectively for toy examples.
Wild Moose just made a blog post[0] about this. They found that putting things into foundation models wasn't cutting it, that you had to have small finely-tuned models along with deterministic processes to use AI for RCA.
LogClaw algorithm is the moat here that flags logs first. Those only flagged usually less than 10% of the logs are analyzed by LLM. LLM is great at finding root cause if the logs are clear and detailed. So the LLM heavily depends on the quality of your logs. So if your logs are rich with info, it will have a better insights at understanding it.
It seems like the load_score serves a proxy for how much needs to be done. Is there a real value that could be used instead? The solution requires syncing with all of the GPU nodes anyways.
Location: New York City, NY, USA (NYC)
Remote: Yes
Willing to relocate: No
Résumé: https://www.linkedin.com/in/henry-zhu-347233121/
Email: find email on my blog OR reach out on linkedin
LinkedIn: https://www.linkedin.com/in/henry-zhu-347233121/
Github: https://github.com/maknee
- Analyzed DeepSeek's distributed filesystem from the ground up in a multipart series
- Built a 100k+ (1TB+) RL dataset for a popular online MOBA game by reverse engineering data format (~50k -> ~100k downloads monthly on huggingface)
- Optimizing finetuning/RL on multi-node GPUs
- Built a framework that intercepts OpenGL pipelines and re‑renders them with NVIDIA ray tracing
I'm looking to work on optimizations and performance analysis of systems. This includes: building profiling tools, building and running large scale distributed systems (training, inference, vectordb, db, compilers, etc...), discussing napkin-math & identifying bottlenecks and writing about the work.
interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)
We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.
Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.
I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.
Great to see another distributed file system open sourced! It has some interesting design decisions.
Have a couple of questions:
- How do you go about benchmarking throughput / latency of such a system? Curious if it's different compared to how other distributed filesystems benchmark their systems.
- Is network or storage the bottleneck for nodes (at least for throughput)?
- From my observations from RDMA-based distributed filesystems, network seems to be the case.
- How does the system respond to rand / seq + reads / writes? A lot of systems struggle to scale writes. Does this matter for what workload TernFS is designed for?
- Very very interesting to go down the path of writing a kernel module instead of using FUSE or writing a native client in userspace (referring to 3FS [1])
- Any crashes in production? And how do you go about tracking it down?
- What's the difference in performance between using the kernel module versus using FUSE?
reply