Yep, we're working on a golden test set with known root causes to benchmark and track agent performance over time. It's taking a bit of work to get right, but we're on it and definitely open to contributions!
Great question! Yes, we're actively building a golden test set of debugging scenarios with known root causes and failure patterns. This allows us to systematically evaluate and improve agent performance with every release. Contributions are very welcome as we expand this effort!
In the meantime, we lean on explainability, i.e. every agent output is grounded in the original logs, traces, and metadata, with inline references. So if the output is off, users can easily verify, debug, and either trust or challenge the agent’s reasoning by reviewing the linked evidence.
Good question! Your setup already covers a lot — but TraceRoot tries to go a bit further in a few areas:
In TraceRoot, we organize all logs, metrics, etc. around traces and build an execution tree. This structured view makes it much easier for our agent to reason through the large amount of telemetry data using context-aware optimizations. (We plan to support slack and notion integration as well.)
It’s not a one-off tool. TraceRoot is a live monitoring platform. It continuously watches what’s happening in prod. So when something breaks, the agent already has full team-visible context, not just a scratchpad session spun up in the moment.
Down the line, we're aiming for automatic bug detection and remediation - not just smarter copiloting, but proactive debugging workflows. The system also retains team-level memory of past bugs, fixes, and infra quirks, so the agent gets smarter over time.
We’ve open sourced a lot of the core. Would love to jam on this if you're up for it. Always down to trade ideas or even hack on something together!
When we say we "organize all logs, metrics, and traces", we mean more than just linking them together (which otel already supports). What we’re doing is:
- context engineering optimization: We leverage the structure among logs, spans, and metadata to filter and group relevant context before passing it to the LLM. In real production issues, it's common to see 10k+ logs, traces, etc. related to a single incident — but most of it is noise. Throwing all that at agents usually leads to poor performance due to context bloat see https://arxiv.org/pdf/2307.03172. We're working on addressing that by doing structured filtering and summarization. For more details see https://bit.ly/45Bai1q.
- Human-in-the-Loop UI: For cases where developers want to manually inspect or guide the agent, we provide a UI that makes it easy to zoom in on relevant subtrees, trace paths, or log clusters and directly select spans to be included in the reasoning of agents.
The goal isn't just unification, it's scalable reasoning over noisy telemetry data, both automated and interactive.
Hope that clears things up a bit! Happy to dive deeper if useful.
It's interesting to wonder if 80% of the question answering can be achieved as a prompts/otel.md over MCPs connected to Claude Code and let agentic reasoning do the rest
Ex:
* When investigating errors, only query for error-level logs
* When investigating performance, only query spans (skip logs unless required) and keep only name, time. Linearize as ... .
* When querying both logs & traces, inline logs near relevant trace as part of an llm-friendly stored artifact jobs/abc123/context.txt
Are there aspects of the question answering (not ui widgets) you think are too hard there?
Yes, we can connect for example CC with MCPs. But this may not work well for example if user wants to check the latency for previous 10 days error log on function A. By using MCP the agent needs to get 10 days error logs at first and then somehow get the latency and correlates them, apply filters for function A. IMO it will hallucinate a lot if there are too many tools, logs and traces. But on TraceRoot platform we "mixed" all necessary data at first, and based on user's query apply filters on structured data, which is more accurate, straightforward and efficient. Here is the README of the general design https://github.com/traceroot-ai/traceroot/tree/main/rest/age...
- We don’t just stream raw logs/traces into an LLM, we build execution trees and correlate data across services and threads. That gives our agent causal context, not just pattern matching.
- It’s designed to debug real issues in production, where things are messy, not just dev or staging.
- We are aiming for automatic bug detection and remediation soon, not just copiloting, but a debugging agent that can spot regressions and trigger fixes proactively.
- We are working on persist historical incidents, fixes, and infra quirks, so the agent improves with each investigation, and doesn’t start from scratch every time.
Happy to dive deeper! Let me know if you have more questions.
We provide an easy to use solution that Sentry is quite complex to use by connecting code context to corresponding loggings and tracings. Also, directly using MCP with LLMs may hallucinate if there are too many tool candidates and if there are a lot of loggings (which is very common) We need to have some optimizations to improve the both of the efficiency and reduce the context fed into the LLMs. An example is shown in this README https://github.com/traceroot-ai/traceroot/tree/main/rest/age... There is also some cursor like UI in TraceRoot to better involve human in the loop which is crucial to minimize the context length and other platforms such as Sentry does not have.
Yep, you're spot on - and we're hearing this loud and clear across the thread. Model abstraction is on the roadmap, and we're already working on making BYOM smoother.
Thanks for the feedback! Totally hear you on the tight OpenAI coupling - we're aware and already working to make BYOM easier. Just to echo what Zecheng said earlier: broader model flexibility is definitely on the roadmap.
Appreciate you calling it out — helps us stay honest about the gaps.
I'm interested in trying agentmail for one of our internal agents, but I noticed you're still on a waitlist and the "contact us" and "contact sales" buttons don’t seem to be clickable.
Just wanted to flag in case it's a bug. Excited to try it once access opens up.
Hi xinweihe, thanks for catching that. We are onboarding the waitlist with very fast turnaround. But, feel free to email me at haakam[at]agentmail[dot]cc
I’m curious why you’d make your email hard to copy paste when in this day and age won’t matter. Any basic AI agent can add your email to spam. Does it really help anymore?
Super cool project — love the direction you're taking with simplifying MCP integration and tooling! The search layer between agent and servers is a clever abstraction to reduce cognitive + compute load. Also appreciate the IKEA curtain flex
Would love to see how this evolves toward more dynamic infra setups. Keep it up!