More

maknee · 2026-03-12T18:11:46 1773339106

How effective are LLMs at triaging issues? Has anyone found success using them to find the root cause? I've only been able to triage effectively for toy examples.

jedberg · 2026-03-12T18:36:33 1773340593

Wild Moose just made a blog post[0] about this. They found that putting things into foundation models wasn't cutting it, that you had to have small finely-tuned models along with deterministic processes to use AI for RCA.

[0] https://www.wildmoose.ai/post/micro-agents-ai-powered-invest...

maknee · 2026-03-13T06:05:20 1773381920

Thanks! Looks like I have to request the whitepaper to take a look at the details.

Robelkidin · 2026-03-13T10:03:19 1773396199

Analyzing logs is not a LLM Foundation model issue.

Robelkidin · 2026-03-13T10:02:05 1773396125

LogClaw algorithm is the moat here that flags logs first. Those only flagged usually less than 10% of the logs are analyzed by LLM. LLM is great at finding root cause if the logs are clear and detailed. So the LLM heavily depends on the quality of your logs. So if your logs are rich with info, it will have a better insights at understanding it.

maknee · 2026-01-14T21:26:23 1768425983

https://maknee.github.io/

maknee · 2026-01-07T20:42:43 1767818563

Great paper! Love the easy to understand explanations and detailed graphs :)

melhindi · 2026-01-08T21:33:33 1767908013

Thank you for the feedback, happy to hear that.

maknee · 2026-01-04T04:31:41 1767501101

Read through the comments. Added RSS and subscribe to my blog!

[1] https://maknee.github.io/blog/

maknee · 2025-12-08T23:03:11 1765234991

It seems like the load_score serves a proxy for how much needs to be done. Is there a real value that could be used instead? The solution requires syncing with all of the GPU nodes anyways.

maknee · 2025-12-02T05:36:47 1764653807

  Location: New York City, NY, USA (NYC)
  Remote: Yes
  Willing to relocate: No
  Résumé: https://www.linkedin.com/in/henry-zhu-347233121/
  Email: find email on my blog OR reach out on linkedin
  LinkedIn: https://www.linkedin.com/in/henry-zhu-347233121/
  Github: https://github.com/maknee

Things I've done (I've put all these projects on my blog: https://maknee.github.io/blog/):

- Analyzed DeepSeek's distributed filesystem from the ground up in a multipart series

- Built a 100k+ (1TB+) RL dataset for a popular online MOBA game by reverse engineering data format (~50k -> ~100k downloads monthly on huggingface)

- Optimizing finetuning/RL on multi-node GPUs

- Built a framework that intercepts OpenGL pipelines and re‑renders them with NVIDIA ray tracing

I'm looking to work on optimizations and performance analysis of systems. This includes: building profiling tools, building and running large scale distributed systems (training, inference, vectordb, db, compilers, etc...), discussing napkin-math & identifying bottlenecks and writing about the work.

maknee · 2025-11-10T18:04:05 1762797845

interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)

mdahardy · 2025-11-10T18:07:18 1762798038

We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.

Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.

Youden · 2025-11-10T19:12:20 1762801940

I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.

maknee · 2025-11-08T01:14:23 1762564463

How does this compare against other DSLs?

chillee · 2025-11-08T02:23:49 1762568629

If you think of Triton as a "baseline", most other DSLs are lower-level than Triton, whereas this is higher-level.

maknee · 2025-09-21T01:12:33 1758417153

Great to see another distributed file system open sourced! It has some interesting design decisions.

Have a couple of questions:

- How do you go about benchmarking throughput / latency of such a system? Curious if it's different compared to how other distributed filesystems benchmark their systems.

- Is network or storage the bottleneck for nodes (at least for throughput)?

  - From my observations from RDMA-based distributed filesystems, network seems to be the case.

- How does the system respond to rand / seq + reads / writes? A lot of systems struggle to scale writes. Does this matter for what workload TernFS is designed for?

- Very very interesting to go down the path of writing a kernel module instead of using FUSE or writing a native client in userspace (referring to 3FS [1])

  - Any crashes in production? And how do you go about tracking it down?

  - What's the difference in performance between using the kernel module versus using FUSE?

[1] https://github.com/deepseek-ai/3FS/blob/main/docs/design_not...

maknee · 2025-04-14T16:32:14 1744648334

They're going to spend time and effort into making their optimizations public. Would you rather have them keep their changes internal?