LLM Agent Trajectory Analysis

Unofficial paper reading:
A Survey for LLM Agent Trajectory Analysis From Failure Attribution to Enhancement

codgician

2026-04-27

From stack traces to trajectories

Much traditional debugging is code/config/runtime-location oriented; agent debugging is often outcome-to-trajectory oriented.

Traditional software

Evaluation signal: a test failure, exception, crash, or violated specification.
Debugging object: source locations, stack traces, breakpoints, and conventional logs.
Repair target: code paths, data structures, configuration, or dependency behavior.

LLM agent systems

Evaluation signal: a task mismatch, unsafe action, inefficient run, or unrecoverable state.
Debugging object: trajectories with prompts, messages, tool calls, observations, handoffs, and state.
Repair target: policy, prompt/context, tool interface, workflow graph, verifier, memory, or supervisor.

Final-output scoring is not enough for agents; rich traces become the main evidence for attribution, repair, and regression testing.

Why ordinary evaluation is not enough

Final-output evaluation compresses a run to one bit and hides where it went wrong:

did planning decompose the task?
was the prompt + context constructed right?
was the right tool called with the right args?
was the observation interpreted correctly?
did routing / verification / supervision fire?

Trajectory analysis keeps that evidence and asks a sharper question:

which step changed the future of the run?
was the wrong answer caused there or downstream?
what evidence supports that step as the cause?
which control surface, if edited, would flip the outcome?
and would the same edit hold across other runs?

Engineering synthesis: instrument the execution path before tuning prompts. Otherwise every failure looks like “the model gave a bad answer.”

Formal execution model

Agent system $M = \{a_1, \ldots, a_N\}$ executes a task $\tau$ in discrete steps. At step $t$ : select $a(t)=g(h_{t-1})$ → form input $x_t=\phi(h_{t-1})$ → run agent $y_t=\pi_{a(t)}(x_t)$ → update context $h_t=u(h_{t-1},a(t),x_t,y_t)$ → emit observable $o_t$ .

selection $g$ , input formation $\phi$ , agent policy $\pi_a$ , context update $u$ . Repair edits one of these.

Observed trace vs omitted state

The survey contrasts partial-observability traces (outputs only) with full-observability traces that include inputs, prompts, and environment state.

Diagnosis can only use what was preserved at run time. Whatever the trace omits becomes a guess that taxonomy, attribution, and repair will inherit.

Trace quality is an engineering choice that bounds every later layer: define what evidence you need for taxonomy and attribution before you decide what the trace must capture.

Engineering reading of the five dimensions

Failure Taxonomy

What it is: A shared map of what kind of failure a trajectory exhibits.
Why it matters: The category chooses the diagnostic search space (plan, memory, handoff, verifier, environment, runtime).
What it enables next: Targeted attribution and repair instead of one-size-fits-all prompt tweaks.

What a taxonomy must make explicit

For a failed trajectory $T$ , a taxonomy should assign $\tau(T, e) \rightarrow (\text{view}, \text{failure type}, \text{evidence span}, \text{repair hint})$ , where $e$ is the trace evidence that supports the class.

View: Which lens the failure is read through (phase, capability module, system/interaction, domain).
Failure type: A short, repair-oriented label that distinguishes this failure from other categories.
Evidence span: The trace segment that justifies the label (plan step, memory update, handoff, verifier event, environment effect).
Repair hint: What a developer should look at next: prompt, context update, workflow graph, verifier, tool, or supervisor.

The important choice is not the label name. It is what evidence the label asks a developer to inspect and what repair it makes plausible.

Four ways to choose the diagnostic search space

Four complementary perspectives the survey identifies for organizing failure taxonomies.

Highlight papers

Taxonomy becomes a repair decision.

01 Lu: phase boundaries

Cut the trace into planning / execution / response and label faults per phase.

06 MAST: MAS organization

Sort failures by specification, inter-agent alignment, or verification gaps.

04 AgentRx: unrecoverable labels

Tie labels to violated constraints at the earliest unrecoverable step.

07 Aegis-Song: environment-shaped failures

Classify exploration, exploitation, and resource-exhaustion failures of agent–environment interaction.

01 Lu — phase boundaries

Claim: Align labels with the execution phase where evidence first appears: planning, execution, or response generation.
Trace lens: The unit of analysis is a phase-delimited run log, not a single final answer.
Developer takeaway: Add phase boundaries to traces before building dashboards or evaluators.

Audience supplement (not on slide):

Behind the slide: Lu et al. inspect 104 failures from 204 runs across TaskWeaver, MetaGPT, and AutoGen on 34 programmable tasks (web crawl, data analysis, file ops); the three phases unfold into 19 lower-level causes (improper decomposition, failed self-refinement loops, code/API errors, env setup, formatting, context-window limits, max-round failures).
Q: How do they actually decide which phase failed? They label failed runs from full execution logs — planner prompts/plans, generated code, tool calls, execution outputs, and per-iteration behavior — not from the final answer alone.
Q: Planning vs execution failure? Planning = decomposition, unrealistic plan, failed self-refinement; execution = bad code, API/schema mistakes, tool-use failure, env setup.
Q: What must I log if I want this taxonomy in my agent? Planner prompts/plans, generated code or tool calls, execution outputs/errors, repair attempts, and final response formatting — without those phase boundaries, final failures cannot be mapped back to the broken component.
Q: Just classification, or does it suggest fixes? Suggests feedback-aware replanning, meta-control routing, and early stopping when the same unresolved error repeats; reports diminishing returns from raising iteration limits.

06 MAST — multi-agent organization

Claim: Shift taxonomy from a single-agent timeline to the organizational structure of multi-agent systems.
Trace lens: Evidence is not only what one agent did, but whether role specification, inter-agent alignment, or task verification failed.
Developer takeaway: If the trace has handoffs and role contracts, classify coordination before changing prompts; the repair may be topology or verifier design.

Audience supplement (not on slide):

Behind the slide: MAST is empirically derived via Grounded Theory over 150+ traces, validated with expert annotators (strong inter-annotator agreement). The taxonomy contains 14 failure modes under three buckets — repetition, context loss, unclear termination, conversation reset, missed clarification, reasoning–action mismatch, premature termination, incomplete or incorrect verification, etc. They release MAST-Data (1,642 annotated traces from 7 MAS frameworks) and a few-shot LLM annotator (agentdash) for scalable labeling.
Q: What’s actually in the three buckets? System design issues (specification, role/topology), inter-agent misalignment (coordination, hand-off, context-loss), task verification (incomplete or incorrect verification, premature termination).
Q: How was the taxonomy built rather than invented? Grounded Theory: open-coding traces, refined with expert rounds, then scaled with an LLM judge that uses MAST definitions and few-shot exemplars.
Q: How would a developer use MAST day-to-day? Annotate failed traces with MAST labels → inspect the failure-mode distribution → change the system → check whether the targeted modes drop. Debugging by failure profile, not by anecdote.
Q: Does MAST identify the responsible agent and decisive step? No — that’s the Who&When framing. MAST explains organizational failure patterns; Who&When pinpoints culprit-agent and step.

04 AgentRx — unrecoverable labels

Claim: Make taxonomy repair-oriented by tying labels to evidence-backed constraints and the first unrecoverable critical failure.
Trace lens: The trace is checked step by step against policies, tool schemas, prefix constraints, and task-specific requirements.
Developer takeaway: Convert informal “agent rules” into checkable constraints; labels become useful when they explain why no later step could recover.

AgentRx — Stage 1: constraint synthesis

Turn vague “agent rules” into checkable constraints anchored at each step.

Global constraints built once from the tool schema and domain policy — schema-valid invocation, declared policy compliance.
Dynamic constraints built per step from the task instruction and observed prefix — consistency with the latest tool output, prefix-implied obligations.
Guarded evaluation — each constraint runs only when its precondition fires; checks are programmatic when possible, semantic (LLM-judged) otherwise.
Output: a step-keyed map from constraint to satisfied / violated / not-applicable.

AgentRx — Stage 2: validation log

Compress violations so the judge reads evidence, not the whole trace.

Record only violations, not the whole trace — keeps the log compact.
Attach supporting evidence to every violation — the tool output, prefix span, or policy clause that triggered it.
Step-keyed and auditable — the judge can trace a final failure backward through dependent violations.
Output: a validation log of step, broken constraint, and supporting evidence per violation.

AgentRx — Stage 3: root-cause judge

Pick the first step from which the agent does not recover, and label its cause.

Find the first unrecoverable step — not the first error; the first violation that explains terminal failure.
Plan-axis labels (plan: did the agent pursue the right intent?) — instruction adherence, intent–plan mismatch, under-specified intent, unsupported intent.
Grounding-axis labels (grounding: did evidence stay faithful?) — invented facts, tool-output misread, handoff failure.
Execution-axis labels (execution: did action complete?) — invalid invocation, guardrails triggered, system failure.
Output: critical failure step, failure category, and a short rationale.

07 Aegis-Song — environment-shaped failures

Claim: Classify failures by agent–environment interaction, not by chronology or capability module.
Trace lens: Group failures into exploration (incomplete information gathering), exploitation (mis-processing of gathered information), and resource exhaustion (turn or token budget).
Developer takeaway: When the symptom is “the agent ran out of budget” or “explored too narrowly,” the right repair surface is the environment / tool interface, not the prompt.

Audience supplement (not on slide):

Behind the slide: Aegis-Song proposes a taxonomy specific to agent–environment interaction, distinct from phase-based or module-based taxonomies. Three top-level buckets — exploration, exploitation, resource exhaustion — each unfold into concrete sub-failures (e.g., state-space navigation, tool-output processing, domain-rule violation, turn/token budget).
Q: Why isn’t this just “tool-use failure”? A pure tool-use category collapses three different failure shapes: not gathering enough state (exploration), corrupting gathered state (exploitation), and running out of budget (resource exhaustion). Each shape implies a different repair.
Q: Why does this taxonomy belong in the taxonomy area at all? Because it is the survey’s only published taxonomy for environment-side failures — phase and capability-module taxonomies do not capture interaction or budget exhaustion as first-class categories.
Q: When should I reach for it instead of MAST or Lu? When your agent operates in a closed environment with a finite budget (web/embodied/tool-rich), the repair surface is the environment + interface, and Aegis-Song labels point you straight at it.
Q: Same paper as 21 Aegis-Kong? No — these are two distinct papers that happen to share the name “Aegis.” Aegis-Song is the taxonomy paper (cited here); Aegis-Kong is the synthetic-error-injection attribution paper that appears in the fine-tuning row of the attribution table.

Failure Attribution

What it is: Identify who caused the failure, when it became unrecoverable, and why the symptom appeared there.
Why it matters: Repair without attribution is guesswork; attribution narrows the component to change.
What it enables next: A concrete repair target: policy, prompt, context update, handoff, verifier, tool, or runtime controller.

When failure becomes inevitable

Let $\Omega(T) \in \{0,1\}$ be the task outcome of trajectory $T$ .
A step $t$ is the failure boundary if every feasible continuation from $T_{\leq t}$ fails (the survey’s “step at which failure becomes inevitable”).
Target: Attribution looks for $t^\ast$ , the earliest such step — typically far before the visible wrong answer. In practice, methods approximate $t^\ast$ with labels.

Four ways to justify blame

Four attribution paradigms the survey identifies, ordered by analytical depth.

Highlight papers

Attribution targets and evidence.

13 Who&When: label target

Defines culprit-agent and decisive-step labels for benchmarkable attribution.

18 CHIEF: causal structure

Converts flat logs into hierarchical causal graphs and back-traces dependencies.

22 DoVer: intervention evidence

Edits the orchestrator message or plan and replays to validate the hypothesis.

13 Who&When — attribution view

Claim: Turn failure attribution into a labeled target: identify the responsible agent and the earliest decisive error step.
Trace lens: The required object is an indexed multi-agent log with agent names, step numbers, task context, and outcome evidence.
Developer takeaway: Store stable agent and step identifiers; without them, “who failed” and “when it became unrecoverable” cannot be benchmarked or audited.

Audience supplement (not on slide):

Behind the slide: Decisive error is defined counterfactually: replace the action at step $t$ with a correct one and the run flips from failure to success. If multiple decisive errors exist, pick the earliest. Dataset: 184 failure annotation tasks from 127 MAS (CaptainAgent-generated systems + Magnetic-One on GAIA / AssistantBench). Each item ships with query, full failure log, system info, responsible agent, decisive step, NL reason. Baselines: all-at-once, step-by-step, binary search, hybrid.
Q: What exactly is the “when” label? Earliest step where correcting that agent’s action would change the outcome from failure to success.
Q: How are labels produced? Expert annotators inspect failed multi-agent logs and assign agent + decisive step + reason; uncertainty resolved by discussion. The paper notes manual attribution itself is hard.
Q: What auto-attribution methods do they test? All-at-once full-log judgment, step-by-step incremental checking, binary-search localization, and a hybrid (all-at-once for agent + step-by-step for step).
Q: What must my trace contain to support this benchmark style? Stable agent names, step numbers, task context, full failure log, final outcome evidence; ground-truth answer when available. Without explicit IDs, “who” and “when” cannot be reproduced.

18 CHIEF — causal structure

Claim: Deepen attribution by converting flat logs into hierarchical causal graphs, then backtracking through dependencies and counterfactual screens.
Trace lens: The useful record contains subtasks, agents, structured step records, handoffs, data references, loops, and tool inputs/outputs.
Developer takeaway: Preserve edges, not just events. Propagation paths explain why the visible bad step may be a downstream symptom.

CHIEF — Stage 1: causal graph construction

Make the flat trace structurally readable.

Per step → OTAR. Extend the prior Thought / Action / Result with Observation — a slot for what each step received.
Trace → subtasks. RAG decomposition + trajectory-aligned reflection.
Three typed edges: subtask order, agent collaboration, and step-level data flow (upstream Result → downstream Observation).
Output: a hierarchical causal graph with subtask and agent nodes.

CHIEF — Stage 2: oracle-guided backtracking

Build a per-subtask checklist; walk the graph through it coarse to fine.

Per-subtask oracle. A 4-line LLM-written checklist: Goal (what this phase should achieve), Pre (what must hold before it starts), Evidence (facts/tool returns to verify), Pass/Fail (falsifiable post-hoc).
Top-down walk (reverse topological): subtask fails Pass/Fail → agent OTAR violates Pre/Evidence → step breaks the checklist.
Prune subgraphs that pass their oracle; drill into the rest.
Output: Failure Candidates — narrowed, not yet proven causes.

CHIEF — Stage 3: counterfactual attribution

Filter candidates along three axes: scope, propagation, persistence.

Local (scope: did it start here?) — no upstream cause explains the bad output → origin is the step itself.
Planning-control (propagation: control loop?) — planner repeats the same plan after error signals, or executor keeps violating valid replans.
Data-flow (propagation: corrupted value?) — walk step-edges back to the earliest step where valid inputs first became wrong.
Deviation-aware (persistence: did it stick?) — drop the candidate if a later step re-satisfies the oracle.
Output: one tuple (Agent, Step, Reason).

22 DoVer — intervention evidence

Claim: Treat attribution as an experimental question: hypothesize a failure point, edit the orchestrator message or plan, and replay from that checkpoint.
Trace lens: The trace must be segmentable into trials with preserved context, checkpoints, and milestone evaluations.
Developer takeaway: Build replay hooks early; causal confidence improves when suspected failures can be validated, refuted, or marked inconclusive.

DoVer — Stage 1: trial segmentation

Cut the session so each re-plan gets its own attribution unit.

Cut at re-plan steps — a planning/re-planning event marks a new trial boundary.
One trial = one plan — the contiguous span from a planning step through everything executed under that plan, until the next re-plan.
Prompt-based, not framework-specific — generalises to systems where re-plan markers aren’t explicit.
Output: trial segments, each treated as its own attribution candidate.

DoVer — Stage 2: hypothesise + intervene

Turn each trial into a testable edit, not a final verdict.

Per-trial hypothesis — log-based attribution names a suspected faulty agent, step, and rationale.
Treat hypothesis as testable, not authoritative — correctness is deferred to the replay.
Concrete edit at the orchestrator level — Modified Instructions to Sub-Agents or Plan Updates (never tool internals).
Output: a targeted intervention at the suspected fault point.

DoVer — Stage 3: replay + verify

Replay the edit; label outcomes along success and faithfulness.

Replay from the checkpoint (setup: same past, edited next step) — preserve all earlier state, then re-run the edited trial.
Validated (success: changed; faithfulness: followed edit) — at least 2 of 3 replays now succeed.
Partial / refuted (success: partial or unchanged; faithfulness: followed edit) — milestone progress improves, or the failure persists.
Inconclusive (faithfulness: edit not carried out) — replay cannot test the hypothesis.
Output: one validation label per trial.

Trajectory Monitoring & Analysis Tools

What it is: The evidence and control layer that decides what a developer can see, replay, compare, and intervene on.
Why it matters: Attribution and repair can only use what the trace preserves; output-only logs force diagnosis to guess.
What it enables next: Capture exactly the trace fields that the taxonomy and attribution objectives demand.

What observability must preserve

From passive monitoring to active debugging

Passive system-level monitoring

captures logs, events, metrics, side effects
summarizes long traces and surfaces patterns
flags anomalies and suspicious trajectories
does not change the run while observing

Active interactive debugging

inspects and annotates trajectory steps
resets, edits, and replays from a checkpoint
forks runs to test counterfactual edits
steers behavior with operator interventions

Highlight papers

Evidence capture and control.

32 AgentSight: system effects

Correlates LLM intent with kernel-level subprocess, file, and network events.

36 AGDebugger: control primitives

Exposes pause, checkpoint, reset, edit, fork, and compare across the trajectory.

32 AgentSight — system effects

Claim: Expand observability below the model log by correlating LLM intent signals with subprocesses, files, network, and kernel-level effects.
Trace lens: The trace has two streams — high-level intent and low-level system actions — joined by lineage, timing, and argument matching.
Developer takeaway: Monitor what the agent actually did to the system, not only what it said it intended to do.

Audience supplement (not on slide):

Behind the slide: AgentSight does boundary tracing at stable interfaces, not agent-framework internals. The Intent Stream uses uprobes on userspace SSL functions (SSL_read / SSL_write) to grab decrypted LLM traffic on the host. The Action Stream uses kernel/process monitoring (process creation, syscalls, file/network/process events, descendant processes). Correlation = process lineage + temporal proximity + argument matching. An observer LLM does semantic security/operations analysis over correlated traces. Case studies: prompt-injection exfiltration, reasoning/tool loop, six-subagent coordination bottleneck.
Q: How does it see LLM intent without instrumenting the framework? Uprobes on SSL_read/SSL_write capture decrypted LLM traffic on the host, regardless of which agent framework is running.
Q: How does it connect a model response to system actions? Process lineage, time windows, and argument matching link intent and action streams across child processes, files, network events, and commands.
Q: What does the observer LLM do? Acts as an analyst over the correlated trace, judging whether observed behavior matches the stated task or indicates risk/waste/coordination problems. Raw + correlated events stay auditable because the observer can be wrong.
Q: Main implementation caveat? TLS interception depends on local crypto stacks and host visibility; remote execution layers, sandboxing, alternate TLS libraries, or provider-side encryption can reduce coverage.

36 AGDebugger — control primitives

Claim: Make trajectory analysis interactive by exposing pause, checkpoint, reset, edit, fork, and comparison operations over multi-agent sessions.
Trace lens: The trace is a rewindable state machine: message history plus checkpoints that can be modified and replayed.
Developer takeaway: Debugging tools should make counterfactual inspection cheap; a developer should test whether changing one message changes downstream behavior.

AGDebugger — Stage 1: inspect + steer

Expose the live message stream so the operator can steer before failure hardens.

Live message viewer — agent-to-agent traffic visible as it happens.
Pause / play / step — drive the message queue at any granularity.
Send new messages mid-run — broadcast to all agents, or targeted to one.
Output: an inspectable, controllable message stream.

AGDebugger — Stage 2: reset + edit

Restore state before testing a counterfactual edit.

Checkpoint per message — agent state saved before each new message via save_state.
Edit historical messages inline, then reset to that timestamp — restores the corresponding checkpoint via load_state.
Fork the session — the original branch is preserved; the edited path runs as a new session.
Output: a forked session with an edit candidate, replayable from the fork point.

AGDebugger — Stage 3: overview compare

Compare branches so edits become visible regression tests.

Vertical timeline — every message a rectangle, every fork a new column.
Forks marked with a horizontal dash; pre-fork shared history is shown at lower opacity.
Color toggle — encode message type, sender, or recipient depending on what you’re hunting.
Output: a branched comparison view — original run vs each counterfactual run, aligned by step.

System Enhancement & Optimization

What it is: Trace-guided editing of the agent system so future runs become more capable, reliable, efficient, or robust.
Why it matters: Diagnosis only earns its keep when it leads to a tested system change rather than another postmortem label.
What it enables next: Choose the right control surface to edit: policy, selection, context update, input formation, workflow, or supervisor.

What can be optimized

The formal target is not “make the prompt better.” It is choosing the system component whose change should improve the measured objective.

Three places a trace can change the system

Three enhancement families the survey identifies, by where the edit lands.

Highlight papers

Where the repair happens.

24 Maestro: graph-plus-config repair

Jointly searches workflow-graph and configuration edits from evaluator feedback.

31 SupervisorAgent: runtime control

Approves, guides, or corrects at risky interaction boundaries during execution.

07 Aegis-Song: environment optimization

Fixes failures by editing the environment and tool interface, not the agent prompt.

24 Maestro — graph-plus-config repair

Claim: Frame enhancement as joint optimization over workflow graph structure and configuration, guided by trace feedback.
Trace lens: Failures are evidence about missing computation, routing, validation, state, or tool operations in a typed agent graph.
Developer takeaway: Do not tune prompts when the graph lacks the operation needed to recover; add the missing node, edge, validator, or state variable.

Audience supplement (not on slide):

Behind the slide: Maestro models the agent as a typed computation graph $G = (V, E)$ with stochastic nodes (LLM calls, retrieval, tools, memory, validators, controllers, merge functions). It runs block-coordinate search — a C-step for configuration (prompts, model, tool sets, decoding params, adapters, retrieval, merge params) and a G-step for graph edits (add/remove/rewire nodes/edges, tools, validators, memory/state, routing, module types). Reflective textual feedback from traces and evaluator rationales drives candidate edits. Warm-started configs + guarded acceptance criterion. Concrete reported edits: extract_entities, validate_constraints, branches_done, numeric_compute. Caveat: the paper’s Section 4 notes some technical details are proprietary.
Q: What does Maestro mean by “graph”? A typed computation graph whose nodes can be LLM calls, retrieval, tools, memory, validators, controllers, or merge functions; edges define artifact / context / control flow.
Q: What gets optimized in the C-step? Existing node/edge configurations — prompts, model choice, tool sets, decoding params, adapters, retrieval settings, merge parameters — graph fixed.
Q: What graph edits did the paper actually find useful? Simple structural additions: extract_entities for HotpotQA, validate_constraints + a rewrite pass for IFBench, an external branches_done state for an interviewer agent, a numeric_compute tool for financial RAG.
Q: How does trace feedback guide graph repair? Evaluator rationales identify missing operations (skipped branches, violated constraints, missing second-hop retrieval, arithmetic errors); Maestro prioritizes targeted local edits instead of blind topology search.

31 SupervisorAgent — runtime control

Claim: Add a lightweight meta-agent that watches high-risk interactions and intervenes while the trajectory is still recoverable.
Trace lens: The supervisor observes agent-agent, agent-tool, and agent-memory events with local and global context summaries.
Developer takeaway: Put supervision at interaction boundaries: approve, guide, correct observations, or run verification before errors become irreversible.

Audience supplement (not on slide):

Behind the slide: Supervision targets agent–agent, agent–tool, and agent–memory interactions. A cheap LLM-free adaptive filter triggers only on risky events: explicit errors, inefficient/repetitive behavior, excessive observation length, sub-agent completion reports. The supervisor sees global task, local task, recent local trace, latest interaction summary, sometimes the global trace. Action set: approve, provide_guidance, correct_observation, run_verification. Strategies: proactive error correction, inefficiency guidance, adaptive observation purification. Reported tradeoff: substantial token reductions while preserving/improving accuracy on several benchmarks; latency increases because supervision adds work; observation purification is lossy.
Q: When does the supervisor actually run? A cheap LLM-free filter triggers on explicit errors, inefficient/repetitive behavior, very long observations, or sub-agent completion reports. Avoids paying for supervision every step.
Q: What can the supervisor do? Approve, provide guidance, correct/purify an observation, or run a verification sub-agent — assist or constrain the base MAS without changing its architecture.
Q: What context does the supervisor see? Global task, local task, recent local trace, latest interaction summary, sometimes global trace. Focused, but enough to intervene.
Q: What’s the tradeoff? Substantial token reductions with preserved/improved accuracy on several benchmarks, but added latency and lossy observation purification.

07 Aegis-Song — environment optimization

Claim: Repair agent failures by editing the environment, not the prompt: enhance observability, offload deterministic computation, and speculate bundled actions.
Trace lens: When traces show exploration/exploitation/resource-exhaustion failures, the implicated control surface is the environment + tool interface, not $\pi_a$ .
Developer takeaway: Before tuning prompts, ask: would the agent succeed if the environment exposed more state, did the deterministic work itself, or accepted batched actions?

Audience supplement (not on slide):

Behind the slide: Aegis-Song proposes a small set of environment-side optimizations validated against its own taxonomy: (1) observability enhancement (expose missing state in tool returns), (2) common computation offloading (move deterministic work out of the LLM), and (3) speculative agentic actions (let the environment accept batched / pre-validated actions).
Q: Why isn’t this just “better tool design”? It is — but the paper’s contribution is the mapping from each environment-side fix to a specific failure category in its own taxonomy, so you can decide which environment-side change to invest in based on the trace.
Q: How does this compare to Maestro? Maestro searches workflow-graph + configuration edits inside the agent system. Aegis-Song edits the environment / tool interface around the agent. They target different control surfaces and can be combined.
Q: When should I reach for this? When repeated trace failures fall into exploration / exploitation / resource buckets, especially in environments with finite budgets (web, embodied, tool-rich). Tuning $\pi_a$ rarely fixes those; environment edits often do.
Q: Same paper as 21 Aegis-Kong? No, two different papers; this one (07 Aegis-Song) is the taxonomy + environment-optimization paper. Aegis-Kong is the synthetic-error-injection attribution paper.

Datasets & Benchmarks

What they are: The field’s measurement and training substrate: what counts as progress, and what diagnostic models learn from.
Why they matter: Benchmark labels become incentives; if they reward only agent/step accuracy, repair utility is invisible.
What they enable next: Compare attribution and enhancement methods, and run regression suites on real or injected failures.

What benchmark design optimizes for

Open critique: Agent-level and step-level accuracy are useful, but they do not fully measure whether attribution helps produce a safe, durable repair. The survey reports limited step-level attribution on established comparisons and points to benchmark diversity and observability as bottlenecks.

Two ways to build trajectory evaluation data

Real-world failure collections

preserve naturally occurring failures from real or realistic systems
reflect messy production conditions and repair needs
expensive to collect and annotate at scale
biased toward what was actually deployed

Synthetic error-injection datasets

start from successful trajectories and inject controlled faults
scale to thousands of labelled examples cheaply
support training data-hungry attribution models
must be checked for realism vs production failures

Highlight papers

Evaluation roles.

13 Who&When

Canonical agent + decisive-step labels for benchmarkable attribution.

23 TraceElephant

Full traces and reproducible environments for replay-ready evaluation.

04 AgentRx

Repair-utility labels: critical step, category, rationale, evidence.

13 Who&When — benchmark view

Claim: As a benchmark, operationalize attribution as two measurable labels: culprit agent and decisive error step.
Trace lens: The dataset turns trace reading into comparable evaluation across alternative LLM-judge strategies.
Developer takeaway: Use this benchmark lens to test whether your traces expose enough indexing and context for reproducible blame assignment.

Audience supplement (not on slide):

Behind the slide: A benchmark item bundles query, full failure log, system info (when available), responsible agent, decisive step, NL failure explanation. Coverage: failed logs from CaptainAgent algorithm-generated systems and hand-crafted Magnetic-One on GAIA + AssistantBench. Hand-crafted logs are much longer, making exact step localization much harder. Metrics: agent accuracy, exact step accuracy, tolerance-window step accuracy.
Q: What’s inside a Who&When item? Query, full failure log, system info if available, responsible agent, decisive step, NL failure explanation.
Q: What systems does the benchmark cover? CaptainAgent algorithm-generated systems + hand-crafted Magnetic-One on GAIA / AssistantBench. Hand-crafted = longer logs = harder exact-step attribution.
Q: How is performance measured? Responsible-agent accuracy, exact decisive-step accuracy, tolerance-window step accuracy (near-miss within a window). The window matters because exact-step is often very low even when the method is in the right region.
Q: Main lesson? Agent-level attribution is feasible but unsolved; exact step attribution remains much harder, especially in long traces. Stronger reasoning models alone do not make it reliable.

23 TraceElephant — benchmark view

Claim: Test attribution under full observability, including complete traces and reproducible environments.
Trace lens: The benchmark contrasts output-only attribution with full-trace static and dynamic replay/counterfactual probing.
Developer takeaway: Evaluate attribution under the observability conditions developers actually have; partial traces can understate achievable diagnosis.

Audience supplement (not on slide):

Behind the slide: Formalizes the earliest inevitable failure step + responsible agent at that step. Trace collection via LLM API middleware without modifying agent implementations. Schema fields: task metadata, agent configuration, architecture, step ID, agent ID/name, input context, output content, tool logs. Dataset: 380 traces (220 failed) across CaptainAgent, Magentic-One, SWE-Agent. Compares full-trace, output-only, metadata/input ablations, static methods, and dynamic replay. Finding: removing inputs/metadata sharply degrades step attribution; dynamic replay improves but exact step accuracy stays around one-third.
Q: What does “full trace” mean here? Task metadata, agent config, architecture, acting agent, step ID, full input context, output content, tool logs (name, args, output, status) — paired with runnable code/configuration for replay.
Q: How were traces captured? LLM API middleware + lightweight preprocessing intercepts requests/responses/tool interactions while preserving original implementations — closer to real developer debugging.
Q: What does dynamic attribution add? Proposes candidate (agent, step, reason) triples, replays from the candidate point through middleware, intervenes, and checks whether a local expected oracle holds — filters spurious static diagnoses.
Q: Practical lesson? Output-only logs understate what’s knowable. Removing inputs/metadata hurts attribution sharply (especially step localization); replay helps but doesn’t solve it.

04 AgentRx — benchmark view

Claim: As a benchmark, AgentRx packages 115 failed trajectories with three repair-relevant labels: critical step, root-cause category, and supporting evidence.
Trace lens: Each failed run is annotated by category, decisive step, and a step-indexed validation log of violated constraints.
Developer takeaway: Use this benchmark when the question is “does my attribution method enable a repair?” — not just “does it match the agent/step label?”

Audience supplement (not on slide):

Behind the slide: Same paper as the taxonomy view earlier in the deck, but reframed for the benchmark area. AgentRx releases 115 failed trajectories drawn from τ-bench, Flash, and Magentic-One, each annotated with: critical (first unrecoverable) step, one of 9 root-cause categories, and a step-indexed validation log of violated constraints + supporting evidence.
Q: Why is this a benchmark contribution and not just a method? Because it is the cleanest published artifact that pairs decisive-step labels with repair-relevant evidence (constraint violations + categories). Most attribution benchmarks stop at agent/step accuracy; this one supplies what a developer would need to act on.
Q: How is this different from Who&When and TraceElephant? Who&When is the canonical agent/step-label benchmark. TraceElephant adds full-observability traces and replay environments. AgentRx adds constraint-grounded labels and rationales — a third axis: not just who failed when, but what rule was violated and what is the evidence.
Q: What does it not give you? End-to-end repair success measurements. The benchmark labels are inputs to repair, not measurements of repair utility — that’s still an open community gap (see the limitations slide).
Q: Why drop MAESTRO from the highlights? MAESTRO is a great operational-telemetry suite, but the benchmark thesis of this deck is “labels should support repair.” MAESTRO measures runtime behavior across architectures rather than diagnostic-to-repair utility, so it stays in the appendix.

Takeaways

One closed loop: collect → classify → attribute → repair → evaluate → operate.
Still research-stage: SOTA exact step-level attribution ≈ 30% (CHIEF on Who&When); full-observability ceiling ≈ 33% (TraceElephant).
For builders: no turnkey tool yet — pick the right axis, expect partial coverage, build the loop incrementally.

Closed-loop agent improvement

Engineering synthesis over the survey: collect enough evidence, constrain diagnosis, approximate the causal boundary, edit the right component, and evaluate whether the edit improves future systems.

Current limitations and open problems

Attribution accuracy is not enough — agent-level and step-level metrics do not necessarily measure repair utility.
Benchmark observability can be mismatched — partial traces understate what developers actually see in internal debugging.
Tools remain fragmented — logging, visualization, replay, anomaly detection, RCA, and repair live in separate systems.
Domain structure is underused — coding, GUI, web, and embodied agents expose different trace shapes and failure modes.

Closing thesis

LLM agent trajectory analysis is becoming the engineering layer that connects evaluation, debugging, system optimization, observability, and operations.

The trajectory is the evidence. Taxonomy gives the diagnostic lens. Attribution identifies the causal boundary. Enhancement edits the system. Monitoring and benchmarks make the loop repeatable.

For agent developers, the practical question is no longer “did this run fail?” It is: what evidence proves why it failed, what component should change, and how do we know the repair generalizes?

Appendix

Each appendix slide summarizes one referenced paper.
The template is: background problem, fundamental idea, developer/evaluation takeaway, and survey areas.
Numbering note: The two-digit prefixes used in the deck (e.g. 01 Lu, 13 Who&When) follow our local analysis-note IDs, not the survey’s Table 1 IDs.

01. Exploring Autonomous Agents

Failure Taxonomy

Background problem: Success rate hides which agent role failed when planner, code generator, executor, or final answer hand off work.
Fundamental idea: Inspect full run logs from 204 executions, label failures into 19 causes under planning, execution, and response-generation phases.
Takeaway: Store per-iteration prompts, code, outputs, and errors, then route repeated failures to replanning, local repair, or early stop.
Paper: https://arxiv.org/abs/2508.13143
Code: https://github.com/lurf21/AgentEvaluationFramework

02. Where LLM Agents Fail

Failure Taxonomy Failure Attribution Datasets & Benchmarks

Background problem: Early agent mistakes cascade, but prior failure studies label errors without tracing the root cause or enabling fixes.
Fundamental idea: AgentDebug labels each step/module, uses counterfactual tests to find the earliest failure-causing step, then re-rolls out with targeted feedback.
Takeaway: Debug failed agents from the first causal step, not every visible mistake; use AgentErrorBench-style annotations to test localization and recovery.
Paper: https://arxiv.org/abs/2509.25370
Code: https://github.com/ulab-uiuc/AgentDebug

03. TRAIL

Failure Taxonomy Failure Attribution Datasets & Benchmarks

Background problem: Agent evals can score final answers, but developers need span-level root causes inside huge structured traces.
Fundamental idea: Annotate OpenTelemetry traces from GAIA/SWE-Bench with error category, span location, evidence, impact, and quality scores.
Takeaway: Use TRAIL to test whether an evaluator debugs real agent runs, not just final answers or synthetic planning cases.
Paper: https://arxiv.org/abs/2505.08638
Code: https://github.com/patronus-ai/trail-benchmark

04. AgentRx

Failure Taxonomy Failure Attribution Datasets & Benchmarks

Background problem: Terminal success hides the first unrecovered mistake; debugging needs step-level evidence, not just outcome labels.
Fundamental idea: Turn tool schemas, policies, and trajectory prefixes into guarded checks; log violations with evidence for each step.
Takeaway: Instrument agents so judges can trace “what constraint broke when” before deciding the unrecoverable root cause.
Paper: https://arxiv.org/abs/2602.02475
Code: https://github.com/microsoft/AgentRx

05. Lifecycle of Failures

Failure Taxonomy Failure Attribution Datasets & Benchmarks

Background problem: In platform agent workflows, visible errors often surface far from the causal node after prompts, tools, and control logic interact.
Fundamental idea: AgentFail labels 307 Dify/Coze failures by root location, cause level/category, propagation distance, and repair strategy.
Takeaway: Debug by proving the earliest decisive node, then apply cause-matched fixes; taxonomy plus location made repairs safer.
Paper: https://arxiv.org/abs/2509.23735
Code: https://github.com/Jenna-Ma/JaWs-AgentFail

06. MAST

Failure Taxonomy

Background problem: MAS benchmark failures hide whether the root cause is system design, inter-agent misalignment, or task verification.
Fundamental idea: Derive MAST bottom-up from 150+ failed traces, yielding 14 failure modes across those three axes.
Takeaway: Label failed traces first; then fix workflow design, agent information flow, or verification instead of blindly swapping models.
Paper: https://arxiv.org/abs/2503.13657
Code: https://github.com/multi-agent-systems-failure-taxonomy/MAST

07. Aegis

Failure Taxonomy System Enhancement & Optimization

Background problem: Agents fail differently across DB, filesystem, CRM, and medical environments: missing state, losing state, miscomputing outputs, violating rules, exhausting turns.
Fundamental idea: Treat tools as reliability infrastructure: expose lookahead/state, offload sorting/calculation/rule checks, and speculate common follow-up calls.
Takeaway: Don’t only tune the agent; redesign tool responses so correct behavior becomes retrieval, validation, or bundled execution.
Paper: https://arxiv.org/abs/2508.19504
Code: Not released

08. LLMs in Agentic Scenarios

Failure Taxonomy

Background problem: Aggregate agent scores hide why tool-using LLMs fail in enterprise-like workflows.
Fundamental idea: Manually code 900 KAMI traces across filesystem, text, CSV, and SQL tasks.
Takeaway: Require grounding, value verification, distractor control, and missing-entity discipline before trusting autonomous tool outputs.
Paper: https://arxiv.org/abs/2512.07497
Code: Not found

09. FAMAS

Failure Attribution

Background problem: A failed MAS log rarely reveals whether an agent action caused failure or only appears downstream.
Fundamental idea: Replay the task, cluster logs into agent-action-state triples, then rank triples with λ-decayed Kulczynski2 plus α/β/γ factors.
Takeaway: Use pass/fail replay spectra when failures recur; FAMAS is statistical attribution, not single-log LLM judging.
Paper: https://arxiv.org/abs/2509.13782
Code: Not found

10. Traceability and Accountability

Failure Attribution Trajectory Monitoring & Analysis Tools

Background problem: Sequential agent pipelines hide where failures begin; final-output scoring cannot separate planner mistakes from executor or critic harm.
Fundamental idea: Record P/E/C answers, final answer, repair flags, harm flags, and earliest unrepaired error origin.
Takeaway: Design handoffs around computable accountability: each stage should expose whether it repaired, preserved, or damaged the prior state.
Paper: https://arxiv.org/abs/2510.07614
Code: Not found

11. CORRECT

Failure Attribution Datasets & Benchmarks

Background problem: Multi-agent failures cascade through long logs; developers need the first bad agent-step, not just a failed-run label.
Fundamental idea: Distill past annotated failures into cached schemas: signatures, triggering context, propagation patterns, and detection heuristics.
Takeaway: Use embedding retrieval over schemas, not raw traces or fine-tuning, to guide an LLM’s step-level failure attribution.
Paper: https://arxiv.org/abs/2509.24088
Code: Not found

12. SDBL

Failure Attribution

Background problem: Exact failure-step attribution in long multi-agent logs overwhelms general LLMs, especially when failures appear late or require MAS-specific expertise.
Fundamental idea: First shrink the log to a small suspect scope, then localize; scopes come from stepwise expansion or Overstep/Loop expert heuristics.
Takeaway: On Who&When, SDBL raises step accuracy by up to 24.27 percentage points, showing scoped diagnosis beats one-shot attribution.
Paper: https://ojs.aaai.org/index.php/AAAI/article/view/40594
Code: https://github.com/Wen-qiangLi/SDBL

13. Who&When

Failure Attribution Datasets & Benchmarks

Background problem: Multi-agent failure attribution lacks ground truth: who is the responsible agent and exactly which step is the decisive error?
Fundamental idea: Curate 184 annotated tasks from 127 MAS runs over GAIA/AssistantBench; label decisive agent-step pairs with rationale.
Takeaway: Test full-context vs incremental judging — agent attribution differs from step attribution; one-pass prompting is not enough.
Paper: https://arxiv.org/abs/2505.00212
Code: https://github.com/ag2ai/Agents_Failure_Attribution

14. ECHO

Failure Attribution

Background problem: Flat logs hide long-range error propagation; local windows miss causally relevant earlier turns.
Fundamental idea: Build four positional context layers, run specialist analyst personas, and aggregate agent/step votes with confidence weighting and disagreement checks.
Takeaway: Use compressed context plus voting when trace length matters; unlike CHIEF graphs or RAFFLES loops, ECHO is positional, not causal.
Paper: https://arxiv.org/abs/2510.04886
Code: Not found

15. RAFFLES

Failure Attribution

Background problem: Single-pass judges miss decisive faults in long agent traces where early mistakes, symptoms, and recoverable errors blur together.
Fundamental idea: A Judge proposes a fault step; specialized Evaluators score fault, primacy, decisiveness, and log consistency, feeding memory into retries.
Takeaway: For trace debugging, use confidence-gated verifier loops, not one-shot attribution; stop when evaluators agree or max iterations expire.
Paper: https://arxiv.org/abs/2509.06822
Code: Not found

16. CDC-MAS

Failure Attribution

Background problem: MAS failures often look downstream; log/data-flow analysis blames the final symptom, not the upstream bad decision.
Fundamental idea: Reverse data-flow into a performance-causality graph, then use Shapley, ACE, and counterfactual repairs to rank agents and steps.
Takeaway: Read for causal MAS debugging: strongest idea is performance causal inversion; evidence improves attribution but remains imperfect on hard traces.
Paper: https://arxiv.org/abs/2509.08682
Code: Not found

17. A2P

Failure Attribution

Background problem: Long agent logs hide the decisive error; “spot the bad step” judging confuses correlation with cause.
Fundamental idea: Prompt one judge to abduct a cause, propose a minimal fix, then simulate 3-5 counterfactual turns.
Takeaway: For trace debugging, test whether a step’s corrected action would change failure into success; step numbers matter.
Paper: https://arxiv.org/abs/2509.10401
Code: https://github.com/ResearAI/A2P

18. CHIEF

Failure Attribution

Background problem: Flat agent logs hide whether failure began in planning, execution, data flow, or later symptom propagation.
Fundamental idea: Build subtask/agent/step graphs, synthesize checkable oracles, then backtrack to the first irreversible corrupted state.
Takeaway: Debuggable MAS need intermediate invariants, not just final-answer judges or after-the-fact trace summaries.
Paper: https://arxiv.org/abs/2602.23701
Code: https://anonymous.4open.science/r/CHIEF-86B8

19. AgenTracer

Failure Attribution

Background problem: Prompting strong LLMs to blame failed agent traces gives poor step-level attribution, often below 10% on prior benchmarks.
Fundamental idea: Build TracerTraj with counterfactual replay and fault injection, then RL-train Qwen3-8B to output agentID | stepID.
Takeaway: Reliable agent debugging needs replay-derived root-cause labels; a small trained tracer can beat generic LLM judges.
Paper: https://arxiv.org/abs/2509.03312
Code: https://github.com/bingreeky/AgenTracer

20. GraphTracer

Failure Attribution

Background problem: Temporal trace attribution confuses late failure symptoms with earlier corrupted information sources in multi-agent deep search.
Fundamental idea: Build an Information Dependency Graph: nodes are produced information, edges are explicit citation/reliance links across agent turns.
Takeaway: Debug agents by tracing provenance paths, not timelines; target high-impact source nodes and dependency conflicts for realistic failure tests.
Paper: https://arxiv.org/abs/2510.10581
Code: Not found

21. AegisKong

Failure Attribution Datasets & Benchmarks

Background problem: Manual MAS failure labels are tiny, so attribution models lack enough agent/error-mode pairs to learn root-cause patterns.
Fundamental idea: Start from successful trajectories, inject MAST-style faults via prompt injection or response corruption, then keep failed runs as labeled counterfactuals.
Takeaway: Aegis turns debugging data into controlled perturbation: 9,533 failures train Qwen/DCL/GRPO models and benchmark agent-error attribution.
Paper: https://arxiv.org/abs/2509.14295
Code: https://kfq20.github.io/AEGIS-Website/

22. DoVer

Failure Attribution System Enhancement & Optimization

Background problem: Single blame steps break down when failed agent runs contain multiple re-plans, branches, and independently repairable mistakes.
Fundamental idea: Segment by re-plan trials, edit the suspected message or plan, then replay from that point with prior context preserved.
Takeaway: Treat attribution as executable evidence: success validates it, faithful failure refutes it, blocked execution exposes missing agent capabilities.
Paper: https://arxiv.org/abs/2512.06749
Code: https://aka.ms/DoVer

23. TraceElephant

Failure Attribution Datasets & Benchmarks

Background problem: Output-only attribution hides prompts, task context, tool logs, and environment state, making the decisive failure step ambiguous.
Fundamental idea: TraceElephant packages 220 annotated failures as JSON step records plus runnable MAS environments for replay and counterfactual checks.
Takeaway: Log agent inputs and tool interactions, not just outputs; full traces lift step attribution to 30.3%, dynamic replay to 33.3%.
Paper: https://openreview.net/forum?id=kLLYJ6Bm7n
Code: https://github.com/TraceElephant/TraceElephant

24. Maestro

System Enhancement & Optimization

Background problem: Prompt tuning cannot fix agents missing tools, state, validators, or routing.
Fundamental idea: Alternate config tuning with local graph edits, guided by scores and reflective trace feedback.
Takeaway: Treat agent failures as architecture signals: add the missing node, then retune its settings.
Paper: https://arxiv.org/abs/2509.04642
Code: https://github.com/relai-ai/relai-sdk

25. CE-Graph

System Enhancement & Optimization

Background problem: Scalar workflow scores collapse rich failure traces, so global search like MaAS/AFlow misses recurring structural error modes.
Fundamental idea: CE-Graph clusters counterexamples by failing node and error semantics, then verifies RevisePrompt, InsertNode, or DeleteNode graph edits.
Takeaway: Optimize agent workflows by reducing dense failure modes, not by blindly searching for higher aggregate benchmark scores.
Paper: https://arxiv.org/abs/2510.10035
Code: Not found

26. ILWS

System Enhancement & Optimization

Background problem: RAG is transient and fine-tuning is heavy; agents need durable domain learning without changing model weights.
Fundamental idea: After sessions, reflect on traces and ratings to edit instructions, preferences, and tools under gated rollback.
Takeaway: Treat system prompts as versioned pseudo-weights: persist only feedback-proven rules, not every retrieved fact.
Paper: https://arxiv.org/abs/2509.00251
Code: Not found

27. SCOPE

System Enhancement & Optimization

Background problem: Agents often see the right context, but static prompts do not teach them how to react to it.
Fundamental idea: Synthesize trace-based guidelines, then route them into tactical or persistent system-prompt memory.
Takeaway: For agents, evolve per-role prompts online; do not just append failures to chat history.
Paper: https://arxiv.org/abs/2512.15374
Code: https://github.com/JarvisPei/SCOPE

28. AgentDevel

System Enhancement & Optimization

Background problem: Self-improving agents can raise averages while hiding regressions across versions.
Fundamental idea: Run traces, label symptoms blindly, script diagnoses, then promote one RC using pass→fail / fail→pass gates.
Takeaway: Build agent improvement as CI: auditable diffs, single version line, regression-first release decisions.
Paper: https://arxiv.org/abs/2601.04620
Code: Not found

29. ReCreate

System Enhancement & Optimization

Background problem: Creating domain agents needs evidence richer than final pass/fail scores.
Fundamental idea: Inspect trajectories, verifier logs, artifacts, and environments to edit scaffold components.
Takeaway: ReCreate outputs generalized prompts, workflows, tools, and memory—not tuned model weights.
Paper: https://arxiv.org/abs/2601.11100
Code: https://github.com/zz-haooo/ReCreate

30. AgentDiet

System Enhancement & Optimization

Background problem: Agent tool traces accumulate useless, repeated, and expired tokens that get resent every later step.
Fundamental idea: Use a cheap reflection LLM to rewrite one delayed long step within a small sliding context window.
Takeaway: Reduce cost empirically, not by guarantee: protect success with delay, thresholds, structure preservation, and pass-rate/step-count checks.
Paper: https://arxiv.org/abs/2509.23586
Code: Not found

31. SupervisorAgent

System Enhancement & Optimization

Background problem: Multi-agent runs fail or waste tokens through errors, loops, and bloated tool observations during execution.
Fundamental idea: Intercept each ActionStep with cheap heuristics, then replace observations, append guidance, or run verification.
Takeaway: Runtime supervision works best as gated process control, not always-on monitoring or post-hoc debugging.
Paper: https://arxiv.org/abs/2510.26585
Code: https://github.com/LINs-lab/SupervisorAgent

32. AgentSight

Trajectory Monitoring & Analysis Tools

Background problem: Framework logs miss shell escapes; syscall monitors miss the LLM intent behind file, process, and network effects.
Fundamental idea: Attach eBPF uprobes to TLS reads/writes and kernel probes to syscalls, then correlate by lineage, timing, and argument matches.
Takeaway: For coding agents, observability must follow descendant processes, not just tool calls, to catch prompt injection and exfiltration.
Paper: https://arxiv.org/abs/2508.02736
Code: https://github.com/eunomia-bpf/agentsight

33. AgentOps Automation

Trajectory Monitoring & Analysis Tools System Enhancement & Optimization

Background problem: Deployed agents are nondeterministic, stateful systems; traditional logs and dashboards miss planning, memory, tool, and coordination failures.
Fundamental idea: Define AgentOps as observe behavior, collect metrics, detect issues, identify root causes, recommend fixes, and automate runtime operations.
Takeaway: Treat agent reliability as production operations: instrument trajectories, compare healthy/failing traces, and close safe feedback loops with automated remediation.
Paper: https://arxiv.org/abs/2507.11277
Code: Not found

34. AgentDiagnose

Trajectory Monitoring & Analysis Tools Failure Attribution

Background problem: Trajectory replays show what happened, but rarely score how agents explore, plan, read observations, or verify progress.
Fundamental idea: Score trajectories on five agentic competencies, then expose patterns through CLI output, dashboard plots, word clouds, and navigation timelines.
Takeaway: Debugging tools should turn raw trajectories into selectable behavioral signals that support diagnosis and training-data filtering.
Paper: https://aclanthology.org/2025.emnlp-demos.15/
Code: https://github.com/oootttyyy/AgentDiagnose

35. Agent Trajectory Explorer

Trajectory Monitoring & Analysis Tools

Background problem: Raw agent trajectories mix prompts, reasoning, tool calls, and observations, making human oversight difficult.
Fundamental idea: Convert JSON traces via formatter plugins into linear TAO turns, with optional raw-context view.
Takeaway: Start debugging UIs with TAO step inspection and per-thought/action positive-negative feedback.
Paper: https://ojs.aaai.org/index.php/AAAI/article/view/35350
Code: Not found

36. AGDebugger

Trajectory Monitoring & Analysis Tools System Enhancement & Optimization

Background problem: Multi-agent failures live inside stateful message queues; log viewers cannot test “what if this message changed?”
Fundamental idea: Checkpoint each agent before messages, then restore, edit one message, and fork a replay branch.
Takeaway: Best for counterfactual steering; study users mostly added specificity, simplified tasks, or changed plans.
Paper: https://arxiv.org/abs/2503.02068
Code: https://github.com/microsoft/agdebugger

37. XAgen

Trajectory Monitoring & Analysis Tools Failure Attribution System Enhancement & Optimization

Background problem: Multi-agent failures are hard for mixed-expertise users to locate, attribute, and correct from raw logs.
Fundamental idea: Convert CrewAI logs into a live flowchart; attach human feedback; use an LLM judge to score task outputs.
Takeaway: Explanations matter when they identify a failing node and feed directly into prompt/config edits and reruns.
Paper: https://arxiv.org/abs/2512.17896
Code: Not found

38. DiLLS

Trajectory Monitoring & Analysis Tools

Background problem: Chronological multi-agent logs hide plan failures, skipped actions, and stalled progress across verbose agent/tool exchanges.
Fundamental idea: Use natural-language probes to organize traces into activity, action, and operation layers for drill-down diagnosis.
Takeaway: Debugging agents needs task-structured trace views: plan updates first, action outcomes next, raw logs last.
Paper: https://arxiv.org/abs/2602.05446
Code: Not found

39. TAR Study of SE Agents

Trajectory Monitoring & Analysis Tools Failure Taxonomy

Background problem: SE agents leave TAR logs, but failures hide in repeated actions, untested fixes, ignored results, and thought-action mismatches.
Fundamental idea: Normalize 120 RepairAgent, AutoCodeRover, and OpenHands trajectories; categorize actions, mine 4-grams, and open-code semantic TAR relations.
Takeaway: Successful agents balance explore/fix/test; robust agents should flag repetition, fix-without-test, premature termination, and result-insensitive next actions.
Paper: https://arxiv.org/abs/2506.18824
Code: https://github.com/sola-st/llm-agents-study

40. MAESTRO Evaluation Suite

Datasets & Benchmarks Trajectory Monitoring & Analysis Tools

Background problem: Final-answer scores miss MAS runtime variance, silent failures, and architecture-driven cost.
Fundamental idea: Run 12 heterogeneous MAS examples under one config, telemetry, and post-processing interface.
Takeaway: Benchmark agent architectures as systems: trace calls, retries, tokens, latency, failures, and stability.
Paper: https://arxiv.org/abs/2601.00481
Code: https://github.com/sands-lab/maestro

41. TrajectoryGuard

Trajectory Monitoring & Analysis Tools Failure Attribution

Background problem: Agent plans can be wrong for the task or structurally incoherent, and LLM judges are too slow for runtime screening.
Fundamental idea: Train a Siamese GRU autoencoder on task/trajectory pairs, combining contrastive alignment with sequence reconstruction anomaly signals.
Takeaway: Use learned trajectory guards for fast pre-execution checks; escalate uncertain or long-horizon cases to heavier judges.
Paper: https://arxiv.org/abs/2601.00516
Code: Not found

42. Features to Actions

Trajectory Monitoring & Analysis Tools

Background problem: Feature attribution explains one prediction, but tool agents fail across state, action, observation trajectories.
Fundamental idea: Package each run as a Minimal Explanation Packet: trace evidence plus rubric flags for intent, tools, state, recovery.
Takeaway: Use SHAP for aggregate rubric importance; use trace rubrics to locate the failed step or violated constraint.
Paper: https://arxiv.org/abs/2602.06841
Code: https://github.com/VectorInstitute/unified-xai-evaluation-framework