LLM Agent Trajectory Analysis

Unofficial paper reading:
A Survey for LLM Agent Trajectory Analysis From Failure Attribution to Enhancement

codgician

2026-04-27

From stack traces to trajectories

Much traditional debugging is code/config/runtime-location oriented; agent debugging is often outcome-to-trajectory oriented.

Traditional software

  • Evaluation signal: a test failure, exception, crash, or violated specification.

  • Debugging object: source locations, stack traces, breakpoints, and conventional logs.

  • Repair target: code paths, data structures, configuration, or dependency behavior.

LLM agent systems

  • Evaluation signal: a task mismatch, unsafe action, inefficient run, or unrecoverable state.

  • Debugging object: trajectories with prompts, messages, tool calls, observations, handoffs, and state.

  • Repair target: policy, prompt/context, tool interface, workflow graph, verifier, memory, or supervisor.

Final-output scoring is not enough for agents; rich traces become the main evidence for attribution, repair, and regression testing.

Why ordinary evaluation is not enough

Final-output evaluation compresses a run to one bit and hides where it went wrong:

  • did planning decompose the task?
  • was the prompt + context constructed right?
  • was the right tool called with the right args?
  • was the observation interpreted correctly?
  • did routing / verification / supervision fire?

Trajectory analysis keeps that evidence and asks a sharper question:

  • which step changed the future of the run?
  • was the wrong answer caused there or downstream?
  • what evidence supports that step as the cause?
  • which control surface, if edited, would flip the outcome?
  • and would the same edit hold across other runs?

Engineering synthesis: instrument the execution path before tuning prompts. Otherwise every failure looks like “the model gave a bad answer.”

Formal execution model

Agent system M={a1,,aN}M = \{a_1, \ldots, a_N\} executes a task τ\tau in discrete steps. At step tt : select a(t)=g(ht1)a(t)=g(h_{t-1}) → form input xt=ϕ(ht1)x_t=\phi(h_{t-1}) → run agent yt=πa(t)(xt)y_t=\pi_{a(t)}(x_t) → update context ht=u(ht1,a(t),xt,yt)h_t=u(h_{t-1},a(t),x_t,y_t) → emit observable oto_t .

selection gg , input formation ϕ\phi , agent policy πa\pi_a , context update uu . Repair edits one of these.

Observed trace vs omitted state

The survey contrasts partial-observability traces (outputs only) with full-observability traces that include inputs, prompts, and environment state.

Diagnosis can only use what was preserved at run time. Whatever the trace omits becomes a guess that taxonomy, attribution, and repair will inherit.

Trace quality is an engineering choice that bounds every later layer: define what evidence you need for taxonomy and attribution before you decide what the trace must capture.

Engineering reading of the five dimensions

Failure Taxonomy

  • What it is: A shared map of what kind of failure a trajectory exhibits.
  • Why it matters: The category chooses the diagnostic search space (plan, memory, handoff, verifier, environment, runtime).
  • What it enables next: Targeted attribution and repair instead of one-size-fits-all prompt tweaks.

What a taxonomy must make explicit

For a failed trajectory TT , a taxonomy should assign τ(T,e)(view,failure type,evidence span,repair hint)\tau(T, e) \rightarrow (\text{view}, \text{failure type}, \text{evidence span}, \text{repair hint}) , where ee is the trace evidence that supports the class.

  • View: Which lens the failure is read through (phase, capability module, system/interaction, domain).
  • Failure type: A short, repair-oriented label that distinguishes this failure from other categories.
  • Evidence span: The trace segment that justifies the label (plan step, memory update, handoff, verifier event, environment effect).
  • Repair hint: What a developer should look at next: prompt, context update, workflow graph, verifier, tool, or supervisor.

The important choice is not the label name. It is what evidence the label asks a developer to inspect and what repair it makes plausible.

Four ways to choose the diagnostic search space

Four complementary perspectives the survey identifies for organizing failure taxonomies.

Highlight papers

Taxonomy becomes a repair decision.

01 Lu: phase boundaries

Cut the trace into planning / execution / response and label faults per phase.

06 MAST: MAS organization

Sort failures by specification, inter-agent alignment, or verification gaps.

04 AgentRx: unrecoverable labels

Tie labels to violated constraints at the earliest unrecoverable step.

07 Aegis-Song: environment-shaped failures

Classify exploration, exploitation, and resource-exhaustion failures of agent–environment interaction.

01 Lu — phase boundaries

  • Claim: Align labels with the execution phase where evidence first appears: planning, execution, or response generation.
  • Trace lens: The unit of analysis is a phase-delimited run log, not a single final answer.
  • Developer takeaway: Add phase boundaries to traces before building dashboards or evaluators.

06 MAST — multi-agent organization

  • Claim: Shift taxonomy from a single-agent timeline to the organizational structure of multi-agent systems.
  • Trace lens: Evidence is not only what one agent did, but whether role specification, inter-agent alignment, or task verification failed.
  • Developer takeaway: If the trace has handoffs and role contracts, classify coordination before changing prompts; the repair may be topology or verifier design.

04 AgentRx — unrecoverable labels

  • Claim: Make taxonomy repair-oriented by tying labels to evidence-backed constraints and the first unrecoverable critical failure.
  • Trace lens: The trace is checked step by step against policies, tool schemas, prefix constraints, and task-specific requirements.
  • Developer takeaway: Convert informal “agent rules” into checkable constraints; labels become useful when they explain why no later step could recover.

AgentRx — Stage 1: constraint synthesis

Turn vague “agent rules” into checkable constraints anchored at each step.

  • Global constraints built once from the tool schema and domain policy — schema-valid invocation, declared policy compliance.
  • Dynamic constraints built per step from the task instruction and observed prefix — consistency with the latest tool output, prefix-implied obligations.
  • Guarded evaluation — each constraint runs only when its precondition fires; checks are programmatic when possible, semantic (LLM-judged) otherwise.
  • Output: a step-keyed map from constraint to satisfied / violated / not-applicable.

AgentRx — Stage 2: validation log

Compress violations so the judge reads evidence, not the whole trace.

  • Record only violations, not the whole trace — keeps the log compact.
  • Attach supporting evidence to every violation — the tool output, prefix span, or policy clause that triggered it.
  • Step-keyed and auditable — the judge can trace a final failure backward through dependent violations.
  • Output: a validation log of step, broken constraint, and supporting evidence per violation.

AgentRx — Stage 3: root-cause judge

Pick the first step from which the agent does not recover, and label its cause.

  • Find the first unrecoverable step — not the first error; the first violation that explains terminal failure.
  • Plan-axis labels (plan: did the agent pursue the right intent?) — instruction adherence, intent–plan mismatch, under-specified intent, unsupported intent.
  • Grounding-axis labels (grounding: did evidence stay faithful?) — invented facts, tool-output misread, handoff failure.
  • Execution-axis labels (execution: did action complete?) — invalid invocation, guardrails triggered, system failure.
  • Output: critical failure step, failure category, and a short rationale.

07 Aegis-Song — environment-shaped failures

  • Claim: Classify failures by agent–environment interaction, not by chronology or capability module.
  • Trace lens: Group failures into exploration (incomplete information gathering), exploitation (mis-processing of gathered information), and resource exhaustion (turn or token budget).
  • Developer takeaway: When the symptom is “the agent ran out of budget” or “explored too narrowly,” the right repair surface is the environment / tool interface, not the prompt.

Failure Attribution

  • What it is: Identify who caused the failure, when it became unrecoverable, and why the symptom appeared there.
  • Why it matters: Repair without attribution is guesswork; attribution narrows the component to change.
  • What it enables next: A concrete repair target: policy, prompt, context update, handoff, verifier, tool, or runtime controller.

When failure becomes inevitable

  • Let Ω(T){0,1}\Omega(T) \in \{0,1\} be the task outcome of trajectory TT .
  • A step tt is the failure boundary if every feasible continuation from TtT_{\leq t} fails (the survey’s “step at which failure becomes inevitable”).
  • Target: Attribution looks for tt^\ast , the earliest such step — typically far before the visible wrong answer. In practice, methods approximate tt^\ast with labels.

Four ways to justify blame

Four attribution paradigms the survey identifies, ordered by analytical depth.

Highlight papers

Attribution targets and evidence.

13 Who&When: label target

Defines culprit-agent and decisive-step labels for benchmarkable attribution.

18 CHIEF: causal structure

Converts flat logs into hierarchical causal graphs and back-traces dependencies.

22 DoVer: intervention evidence

Edits the orchestrator message or plan and replays to validate the hypothesis.

13 Who&When — attribution view

  • Claim: Turn failure attribution into a labeled target: identify the responsible agent and the earliest decisive error step.
  • Trace lens: The required object is an indexed multi-agent log with agent names, step numbers, task context, and outcome evidence.
  • Developer takeaway: Store stable agent and step identifiers; without them, “who failed” and “when it became unrecoverable” cannot be benchmarked or audited.

18 CHIEF — causal structure

  • Claim: Deepen attribution by converting flat logs into hierarchical causal graphs, then backtracking through dependencies and counterfactual screens.
  • Trace lens: The useful record contains subtasks, agents, structured step records, handoffs, data references, loops, and tool inputs/outputs.
  • Developer takeaway: Preserve edges, not just events. Propagation paths explain why the visible bad step may be a downstream symptom.

CHIEF — Stage 1: causal graph construction

Make the flat trace structurally readable.

  • Per step → OTAR. Extend the prior Thought / Action / Result with Observation — a slot for what each step received.
  • Trace → subtasks. RAG decomposition + trajectory-aligned reflection.
  • Three typed edges: subtask order, agent collaboration, and step-level data flow (upstream Result → downstream Observation).
  • Output: a hierarchical causal graph with subtask and agent nodes.

CHIEF — Stage 2: oracle-guided backtracking

Build a per-subtask checklist; walk the graph through it coarse to fine.

  • Per-subtask oracle. A 4-line LLM-written checklist: Goal (what this phase should achieve), Pre (what must hold before it starts), Evidence (facts/tool returns to verify), Pass/Fail (falsifiable post-hoc).
  • Top-down walk (reverse topological): subtask fails Pass/Fail → agent OTAR violates Pre/Evidence → step breaks the checklist.
  • Prune subgraphs that pass their oracle; drill into the rest.
  • Output: Failure Candidates — narrowed, not yet proven causes.

CHIEF — Stage 3: counterfactual attribution

Filter candidates along three axes: scope, propagation, persistence.

  • Local (scope: did it start here?) — no upstream cause explains the bad output → origin is the step itself.
  • Planning-control (propagation: control loop?) — planner repeats the same plan after error signals, or executor keeps violating valid replans.
  • Data-flow (propagation: corrupted value?) — walk step-edges back to the earliest step where valid inputs first became wrong.
  • Deviation-aware (persistence: did it stick?) — drop the candidate if a later step re-satisfies the oracle.
  • Output: one tuple (Agent, Step, Reason).

22 DoVer — intervention evidence

  • Claim: Treat attribution as an experimental question: hypothesize a failure point, edit the orchestrator message or plan, and replay from that checkpoint.
  • Trace lens: The trace must be segmentable into trials with preserved context, checkpoints, and milestone evaluations.
  • Developer takeaway: Build replay hooks early; causal confidence improves when suspected failures can be validated, refuted, or marked inconclusive.

DoVer — Stage 1: trial segmentation

Cut the session so each re-plan gets its own attribution unit.

  • Cut at re-plan steps — a planning/re-planning event marks a new trial boundary.
  • One trial = one plan — the contiguous span from a planning step through everything executed under that plan, until the next re-plan.
  • Prompt-based, not framework-specific — generalises to systems where re-plan markers aren’t explicit.
  • Output: trial segments, each treated as its own attribution candidate.

DoVer — Stage 2: hypothesise + intervene

Turn each trial into a testable edit, not a final verdict.

  • Per-trial hypothesis — log-based attribution names a suspected faulty agent, step, and rationale.
  • Treat hypothesis as testable, not authoritative — correctness is deferred to the replay.
  • Concrete edit at the orchestrator levelModified Instructions to Sub-Agents or Plan Updates (never tool internals).
  • Output: a targeted intervention at the suspected fault point.

DoVer — Stage 3: replay + verify

Replay the edit; label outcomes along success and faithfulness.

  • Replay from the checkpoint (setup: same past, edited next step) — preserve all earlier state, then re-run the edited trial.
  • Validated (success: changed; faithfulness: followed edit) — at least 2 of 3 replays now succeed.
  • Partial / refuted (success: partial or unchanged; faithfulness: followed edit) — milestone progress improves, or the failure persists.
  • Inconclusive (faithfulness: edit not carried out) — replay cannot test the hypothesis.
  • Output: one validation label per trial.

Trajectory Monitoring & Analysis Tools

  • What it is: The evidence and control layer that decides what a developer can see, replay, compare, and intervene on.
  • Why it matters: Attribution and repair can only use what the trace preserves; output-only logs force diagnosis to guess.
  • What it enables next: Capture exactly the trace fields that the taxonomy and attribution objectives demand.

What observability must preserve

From passive monitoring to active debugging

Passive system-level monitoring

  • captures logs, events, metrics, side effects
  • summarizes long traces and surfaces patterns
  • flags anomalies and suspicious trajectories
  • does not change the run while observing

Active interactive debugging

  • inspects and annotates trajectory steps
  • resets, edits, and replays from a checkpoint
  • forks runs to test counterfactual edits
  • steers behavior with operator interventions

Highlight papers

Evidence capture and control.

32 AgentSight: system effects

Correlates LLM intent with kernel-level subprocess, file, and network events.

36 AGDebugger: control primitives

Exposes pause, checkpoint, reset, edit, fork, and compare across the trajectory.

32 AgentSight — system effects

  • Claim: Expand observability below the model log by correlating LLM intent signals with subprocesses, files, network, and kernel-level effects.
  • Trace lens: The trace has two streams — high-level intent and low-level system actions — joined by lineage, timing, and argument matching.
  • Developer takeaway: Monitor what the agent actually did to the system, not only what it said it intended to do.

36 AGDebugger — control primitives

  • Claim: Make trajectory analysis interactive by exposing pause, checkpoint, reset, edit, fork, and comparison operations over multi-agent sessions.
  • Trace lens: The trace is a rewindable state machine: message history plus checkpoints that can be modified and replayed.
  • Developer takeaway: Debugging tools should make counterfactual inspection cheap; a developer should test whether changing one message changes downstream behavior.

AGDebugger — Stage 1: inspect + steer

Expose the live message stream so the operator can steer before failure hardens.

  • Live message viewer — agent-to-agent traffic visible as it happens.
  • Pause / play / step — drive the message queue at any granularity.
  • Send new messages mid-run — broadcast to all agents, or targeted to one.
  • Output: an inspectable, controllable message stream.

AGDebugger — Stage 2: reset + edit

Restore state before testing a counterfactual edit.

  • Checkpoint per message — agent state saved before each new message via save_state.
  • Edit historical messages inline, then reset to that timestamp — restores the corresponding checkpoint via load_state.
  • Fork the session — the original branch is preserved; the edited path runs as a new session.
  • Output: a forked session with an edit candidate, replayable from the fork point.

AGDebugger — Stage 3: overview compare

Compare branches so edits become visible regression tests.

  • Vertical timeline — every message a rectangle, every fork a new column.
  • Forks marked with a horizontal dash; pre-fork shared history is shown at lower opacity.
  • Color toggle — encode message type, sender, or recipient depending on what you’re hunting.
  • Output: a branched comparison view — original run vs each counterfactual run, aligned by step.

System Enhancement & Optimization

  • What it is: Trace-guided editing of the agent system so future runs become more capable, reliable, efficient, or robust.
  • Why it matters: Diagnosis only earns its keep when it leads to a tested system change rather than another postmortem label.
  • What it enables next: Choose the right control surface to edit: policy, selection, context update, input formation, workflow, or supervisor.

What can be optimized

The formal target is not “make the prompt better.” It is choosing the system component whose change should improve the measured objective.

Three places a trace can change the system

Three enhancement families the survey identifies, by where the edit lands.

Highlight papers

Where the repair happens.

24 Maestro: graph-plus-config repair

Jointly searches workflow-graph and configuration edits from evaluator feedback.

31 SupervisorAgent: runtime control

Approves, guides, or corrects at risky interaction boundaries during execution.

07 Aegis-Song: environment optimization

Fixes failures by editing the environment and tool interface, not the agent prompt.

24 Maestro — graph-plus-config repair

  • Claim: Frame enhancement as joint optimization over workflow graph structure and configuration, guided by trace feedback.
  • Trace lens: Failures are evidence about missing computation, routing, validation, state, or tool operations in a typed agent graph.
  • Developer takeaway: Do not tune prompts when the graph lacks the operation needed to recover; add the missing node, edge, validator, or state variable.

31 SupervisorAgent — runtime control

  • Claim: Add a lightweight meta-agent that watches high-risk interactions and intervenes while the trajectory is still recoverable.
  • Trace lens: The supervisor observes agent-agent, agent-tool, and agent-memory events with local and global context summaries.
  • Developer takeaway: Put supervision at interaction boundaries: approve, guide, correct observations, or run verification before errors become irreversible.

07 Aegis-Song — environment optimization

  • Claim: Repair agent failures by editing the environment, not the prompt: enhance observability, offload deterministic computation, and speculate bundled actions.
  • Trace lens: When traces show exploration/exploitation/resource-exhaustion failures, the implicated control surface is the environment + tool interface, not πa\pi_a .
  • Developer takeaway: Before tuning prompts, ask: would the agent succeed if the environment exposed more state, did the deterministic work itself, or accepted batched actions?

Datasets & Benchmarks

  • What they are: The field’s measurement and training substrate: what counts as progress, and what diagnostic models learn from.
  • Why they matter: Benchmark labels become incentives; if they reward only agent/step accuracy, repair utility is invisible.
  • What they enable next: Compare attribution and enhancement methods, and run regression suites on real or injected failures.

What benchmark design optimizes for

Open critique: Agent-level and step-level accuracy are useful, but they do not fully measure whether attribution helps produce a safe, durable repair. The survey reports limited step-level attribution on established comparisons and points to benchmark diversity and observability as bottlenecks.

Two ways to build trajectory evaluation data

Real-world failure collections

  • preserve naturally occurring failures from real or realistic systems
  • reflect messy production conditions and repair needs
  • expensive to collect and annotate at scale
  • biased toward what was actually deployed

Synthetic error-injection datasets

  • start from successful trajectories and inject controlled faults
  • scale to thousands of labelled examples cheaply
  • support training data-hungry attribution models
  • must be checked for realism vs production failures

Highlight papers

Evaluation roles.

13 Who&When

Canonical agent + decisive-step labels for benchmarkable attribution.

23 TraceElephant

Full traces and reproducible environments for replay-ready evaluation.

04 AgentRx

Repair-utility labels: critical step, category, rationale, evidence.

13 Who&When — benchmark view

  • Claim: As a benchmark, operationalize attribution as two measurable labels: culprit agent and decisive error step.
  • Trace lens: The dataset turns trace reading into comparable evaluation across alternative LLM-judge strategies.
  • Developer takeaway: Use this benchmark lens to test whether your traces expose enough indexing and context for reproducible blame assignment.

23 TraceElephant — benchmark view

  • Claim: Test attribution under full observability, including complete traces and reproducible environments.
  • Trace lens: The benchmark contrasts output-only attribution with full-trace static and dynamic replay/counterfactual probing.
  • Developer takeaway: Evaluate attribution under the observability conditions developers actually have; partial traces can understate achievable diagnosis.

04 AgentRx — benchmark view

  • Claim: As a benchmark, AgentRx packages 115 failed trajectories with three repair-relevant labels: critical step, root-cause category, and supporting evidence.
  • Trace lens: Each failed run is annotated by category, decisive step, and a step-indexed validation log of violated constraints.
  • Developer takeaway: Use this benchmark when the question is “does my attribution method enable a repair?” — not just “does it match the agent/step label?”

Takeaways

  • One closed loop: collect → classify → attribute → repair → evaluate → operate.
  • Still research-stage: SOTA exact step-level attribution ≈ 30% (CHIEF on Who&When); full-observability ceiling ≈ 33% (TraceElephant).
  • For builders: no turnkey tool yet — pick the right axis, expect partial coverage, build the loop incrementally.

Closed-loop agent improvement

Engineering synthesis over the survey: collect enough evidence, constrain diagnosis, approximate the causal boundary, edit the right component, and evaluate whether the edit improves future systems.

Current limitations and open problems

  • Attribution accuracy is not enough — agent-level and step-level metrics do not necessarily measure repair utility.
  • Benchmark observability can be mismatched — partial traces understate what developers actually see in internal debugging.
  • Tools remain fragmented — logging, visualization, replay, anomaly detection, RCA, and repair live in separate systems.
  • Domain structure is underused — coding, GUI, web, and embodied agents expose different trace shapes and failure modes.

Closing thesis

LLM agent trajectory analysis is becoming the engineering layer that connects evaluation, debugging, system optimization, observability, and operations.

The trajectory is the evidence. Taxonomy gives the diagnostic lens. Attribution identifies the causal boundary. Enhancement edits the system. Monitoring and benchmarks make the loop repeatable.

For agent developers, the practical question is no longer “did this run fail?” It is: what evidence proves why it failed, what component should change, and how do we know the repair generalizes?

Appendix

  • Each appendix slide summarizes one referenced paper.
  • The template is: background problem, fundamental idea, developer/evaluation takeaway, and survey areas.
  • Numbering note: The two-digit prefixes used in the deck (e.g. 01 Lu, 13 Who&When) follow our local analysis-note IDs, not the survey’s Table 1 IDs.

01. Exploring Autonomous Agents

Failure Taxonomy

  • Background problem: Success rate hides which agent role failed when planner, code generator, executor, or final answer hand off work.
  • Fundamental idea: Inspect full run logs from 204 executions, label failures into 19 causes under planning, execution, and response-generation phases.
  • Takeaway: Store per-iteration prompts, code, outputs, and errors, then route repeated failures to replanning, local repair, or early stop.
  • Paper: https://arxiv.org/abs/2508.13143
  • Code: https://github.com/lurf21/AgentEvaluationFramework

02. Where LLM Agents Fail

Failure Taxonomy Failure Attribution Datasets & Benchmarks

  • Background problem: Early agent mistakes cascade, but prior failure studies label errors without tracing the root cause or enabling fixes.
  • Fundamental idea: AgentDebug labels each step/module, uses counterfactual tests to find the earliest failure-causing step, then re-rolls out with targeted feedback.
  • Takeaway: Debug failed agents from the first causal step, not every visible mistake; use AgentErrorBench-style annotations to test localization and recovery.
  • Paper: https://arxiv.org/abs/2509.25370
  • Code: https://github.com/ulab-uiuc/AgentDebug

03. TRAIL

Failure Taxonomy Failure Attribution Datasets & Benchmarks

  • Background problem: Agent evals can score final answers, but developers need span-level root causes inside huge structured traces.
  • Fundamental idea: Annotate OpenTelemetry traces from GAIA/SWE-Bench with error category, span location, evidence, impact, and quality scores.
  • Takeaway: Use TRAIL to test whether an evaluator debugs real agent runs, not just final answers or synthetic planning cases.
  • Paper: https://arxiv.org/abs/2505.08638
  • Code: https://github.com/patronus-ai/trail-benchmark

04. AgentRx

Failure Taxonomy Failure Attribution Datasets & Benchmarks

  • Background problem: Terminal success hides the first unrecovered mistake; debugging needs step-level evidence, not just outcome labels.
  • Fundamental idea: Turn tool schemas, policies, and trajectory prefixes into guarded checks; log violations with evidence for each step.
  • Takeaway: Instrument agents so judges can trace “what constraint broke when” before deciding the unrecoverable root cause.
  • Paper: https://arxiv.org/abs/2602.02475
  • Code: https://github.com/microsoft/AgentRx

05. Lifecycle of Failures

Failure Taxonomy Failure Attribution Datasets & Benchmarks

  • Background problem: In platform agent workflows, visible errors often surface far from the causal node after prompts, tools, and control logic interact.
  • Fundamental idea: AgentFail labels 307 Dify/Coze failures by root location, cause level/category, propagation distance, and repair strategy.
  • Takeaway: Debug by proving the earliest decisive node, then apply cause-matched fixes; taxonomy plus location made repairs safer.
  • Paper: https://arxiv.org/abs/2509.23735
  • Code: https://github.com/Jenna-Ma/JaWs-AgentFail

06. MAST

Failure Taxonomy

  • Background problem: MAS benchmark failures hide whether the root cause is system design, inter-agent misalignment, or task verification.
  • Fundamental idea: Derive MAST bottom-up from 150+ failed traces, yielding 14 failure modes across those three axes.
  • Takeaway: Label failed traces first; then fix workflow design, agent information flow, or verification instead of blindly swapping models.
  • Paper: https://arxiv.org/abs/2503.13657
  • Code: https://github.com/multi-agent-systems-failure-taxonomy/MAST

07. Aegis

Failure Taxonomy System Enhancement & Optimization

  • Background problem: Agents fail differently across DB, filesystem, CRM, and medical environments: missing state, losing state, miscomputing outputs, violating rules, exhausting turns.
  • Fundamental idea: Treat tools as reliability infrastructure: expose lookahead/state, offload sorting/calculation/rule checks, and speculate common follow-up calls.
  • Takeaway: Don’t only tune the agent; redesign tool responses so correct behavior becomes retrieval, validation, or bundled execution.
  • Paper: https://arxiv.org/abs/2508.19504
  • Code: Not released

08. LLMs in Agentic Scenarios

Failure Taxonomy

  • Background problem: Aggregate agent scores hide why tool-using LLMs fail in enterprise-like workflows.
  • Fundamental idea: Manually code 900 KAMI traces across filesystem, text, CSV, and SQL tasks.
  • Takeaway: Require grounding, value verification, distractor control, and missing-entity discipline before trusting autonomous tool outputs.
  • Paper: https://arxiv.org/abs/2512.07497
  • Code: Not found

09. FAMAS

Failure Attribution

  • Background problem: A failed MAS log rarely reveals whether an agent action caused failure or only appears downstream.
  • Fundamental idea: Replay the task, cluster logs into agent-action-state triples, then rank triples with λ-decayed Kulczynski2 plus α/β/γ factors.
  • Takeaway: Use pass/fail replay spectra when failures recur; FAMAS is statistical attribution, not single-log LLM judging.
  • Paper: https://arxiv.org/abs/2509.13782
  • Code: Not found

10. Traceability and Accountability

Failure Attribution Trajectory Monitoring & Analysis Tools

  • Background problem: Sequential agent pipelines hide where failures begin; final-output scoring cannot separate planner mistakes from executor or critic harm.
  • Fundamental idea: Record P/E/C answers, final answer, repair flags, harm flags, and earliest unrepaired error origin.
  • Takeaway: Design handoffs around computable accountability: each stage should expose whether it repaired, preserved, or damaged the prior state.
  • Paper: https://arxiv.org/abs/2510.07614
  • Code: Not found

11. CORRECT

Failure Attribution Datasets & Benchmarks

  • Background problem: Multi-agent failures cascade through long logs; developers need the first bad agent-step, not just a failed-run label.
  • Fundamental idea: Distill past annotated failures into cached schemas: signatures, triggering context, propagation patterns, and detection heuristics.
  • Takeaway: Use embedding retrieval over schemas, not raw traces or fine-tuning, to guide an LLM’s step-level failure attribution.
  • Paper: https://arxiv.org/abs/2509.24088
  • Code: Not found

12. SDBL

Failure Attribution

  • Background problem: Exact failure-step attribution in long multi-agent logs overwhelms general LLMs, especially when failures appear late or require MAS-specific expertise.
  • Fundamental idea: First shrink the log to a small suspect scope, then localize; scopes come from stepwise expansion or Overstep/Loop expert heuristics.
  • Takeaway: On Who&When, SDBL raises step accuracy by up to 24.27 percentage points, showing scoped diagnosis beats one-shot attribution.
  • Paper: https://ojs.aaai.org/index.php/AAAI/article/view/40594
  • Code: https://github.com/Wen-qiangLi/SDBL

13. Who&When

Failure Attribution Datasets & Benchmarks

  • Background problem: Multi-agent failure attribution lacks ground truth: who is the responsible agent and exactly which step is the decisive error?
  • Fundamental idea: Curate 184 annotated tasks from 127 MAS runs over GAIA/AssistantBench; label decisive agent-step pairs with rationale.
  • Takeaway: Test full-context vs incremental judging — agent attribution differs from step attribution; one-pass prompting is not enough.
  • Paper: https://arxiv.org/abs/2505.00212
  • Code: https://github.com/ag2ai/Agents_Failure_Attribution

14. ECHO

Failure Attribution

  • Background problem: Flat logs hide long-range error propagation; local windows miss causally relevant earlier turns.
  • Fundamental idea: Build four positional context layers, run specialist analyst personas, and aggregate agent/step votes with confidence weighting and disagreement checks.
  • Takeaway: Use compressed context plus voting when trace length matters; unlike CHIEF graphs or RAFFLES loops, ECHO is positional, not causal.
  • Paper: https://arxiv.org/abs/2510.04886
  • Code: Not found

15. RAFFLES

Failure Attribution

  • Background problem: Single-pass judges miss decisive faults in long agent traces where early mistakes, symptoms, and recoverable errors blur together.
  • Fundamental idea: A Judge proposes a fault step; specialized Evaluators score fault, primacy, decisiveness, and log consistency, feeding memory into retries.
  • Takeaway: For trace debugging, use confidence-gated verifier loops, not one-shot attribution; stop when evaluators agree or max iterations expire.
  • Paper: https://arxiv.org/abs/2509.06822
  • Code: Not found

16. CDC-MAS

Failure Attribution

  • Background problem: MAS failures often look downstream; log/data-flow analysis blames the final symptom, not the upstream bad decision.
  • Fundamental idea: Reverse data-flow into a performance-causality graph, then use Shapley, ACE, and counterfactual repairs to rank agents and steps.
  • Takeaway: Read for causal MAS debugging: strongest idea is performance causal inversion; evidence improves attribution but remains imperfect on hard traces.
  • Paper: https://arxiv.org/abs/2509.08682
  • Code: Not found

17. A2P

Failure Attribution

  • Background problem: Long agent logs hide the decisive error; “spot the bad step” judging confuses correlation with cause.
  • Fundamental idea: Prompt one judge to abduct a cause, propose a minimal fix, then simulate 3-5 counterfactual turns.
  • Takeaway: For trace debugging, test whether a step’s corrected action would change failure into success; step numbers matter.
  • Paper: https://arxiv.org/abs/2509.10401
  • Code: https://github.com/ResearAI/A2P

18. CHIEF

Failure Attribution

  • Background problem: Flat agent logs hide whether failure began in planning, execution, data flow, or later symptom propagation.
  • Fundamental idea: Build subtask/agent/step graphs, synthesize checkable oracles, then backtrack to the first irreversible corrupted state.
  • Takeaway: Debuggable MAS need intermediate invariants, not just final-answer judges or after-the-fact trace summaries.
  • Paper: https://arxiv.org/abs/2602.23701
  • Code: https://anonymous.4open.science/r/CHIEF-86B8

19. AgenTracer

Failure Attribution

  • Background problem: Prompting strong LLMs to blame failed agent traces gives poor step-level attribution, often below 10% on prior benchmarks.
  • Fundamental idea: Build TracerTraj with counterfactual replay and fault injection, then RL-train Qwen3-8B to output agentID | stepID.
  • Takeaway: Reliable agent debugging needs replay-derived root-cause labels; a small trained tracer can beat generic LLM judges.
  • Paper: https://arxiv.org/abs/2509.03312
  • Code: https://github.com/bingreeky/AgenTracer

20. GraphTracer

Failure Attribution

  • Background problem: Temporal trace attribution confuses late failure symptoms with earlier corrupted information sources in multi-agent deep search.
  • Fundamental idea: Build an Information Dependency Graph: nodes are produced information, edges are explicit citation/reliance links across agent turns.
  • Takeaway: Debug agents by tracing provenance paths, not timelines; target high-impact source nodes and dependency conflicts for realistic failure tests.
  • Paper: https://arxiv.org/abs/2510.10581
  • Code: Not found

21. AegisKong

Failure Attribution Datasets & Benchmarks

  • Background problem: Manual MAS failure labels are tiny, so attribution models lack enough agent/error-mode pairs to learn root-cause patterns.
  • Fundamental idea: Start from successful trajectories, inject MAST-style faults via prompt injection or response corruption, then keep failed runs as labeled counterfactuals.
  • Takeaway: Aegis turns debugging data into controlled perturbation: 9,533 failures train Qwen/DCL/GRPO models and benchmark agent-error attribution.
  • Paper: https://arxiv.org/abs/2509.14295
  • Code: https://kfq20.github.io/AEGIS-Website/

22. DoVer

Failure Attribution System Enhancement & Optimization

  • Background problem: Single blame steps break down when failed agent runs contain multiple re-plans, branches, and independently repairable mistakes.
  • Fundamental idea: Segment by re-plan trials, edit the suspected message or plan, then replay from that point with prior context preserved.
  • Takeaway: Treat attribution as executable evidence: success validates it, faithful failure refutes it, blocked execution exposes missing agent capabilities.
  • Paper: https://arxiv.org/abs/2512.06749
  • Code: https://aka.ms/DoVer

23. TraceElephant

Failure Attribution Datasets & Benchmarks

  • Background problem: Output-only attribution hides prompts, task context, tool logs, and environment state, making the decisive failure step ambiguous.
  • Fundamental idea: TraceElephant packages 220 annotated failures as JSON step records plus runnable MAS environments for replay and counterfactual checks.
  • Takeaway: Log agent inputs and tool interactions, not just outputs; full traces lift step attribution to 30.3%, dynamic replay to 33.3%.
  • Paper: https://openreview.net/forum?id=kLLYJ6Bm7n
  • Code: https://github.com/TraceElephant/TraceElephant

24. Maestro

System Enhancement & Optimization

  • Background problem: Prompt tuning cannot fix agents missing tools, state, validators, or routing.
  • Fundamental idea: Alternate config tuning with local graph edits, guided by scores and reflective trace feedback.
  • Takeaway: Treat agent failures as architecture signals: add the missing node, then retune its settings.
  • Paper: https://arxiv.org/abs/2509.04642
  • Code: https://github.com/relai-ai/relai-sdk

25. CE-Graph

System Enhancement & Optimization

  • Background problem: Scalar workflow scores collapse rich failure traces, so global search like MaAS/AFlow misses recurring structural error modes.
  • Fundamental idea: CE-Graph clusters counterexamples by failing node and error semantics, then verifies RevisePrompt, InsertNode, or DeleteNode graph edits.
  • Takeaway: Optimize agent workflows by reducing dense failure modes, not by blindly searching for higher aggregate benchmark scores.
  • Paper: https://arxiv.org/abs/2510.10035
  • Code: Not found

26. ILWS

System Enhancement & Optimization

  • Background problem: RAG is transient and fine-tuning is heavy; agents need durable domain learning without changing model weights.
  • Fundamental idea: After sessions, reflect on traces and ratings to edit instructions, preferences, and tools under gated rollback.
  • Takeaway: Treat system prompts as versioned pseudo-weights: persist only feedback-proven rules, not every retrieved fact.
  • Paper: https://arxiv.org/abs/2509.00251
  • Code: Not found

27. SCOPE

System Enhancement & Optimization

  • Background problem: Agents often see the right context, but static prompts do not teach them how to react to it.
  • Fundamental idea: Synthesize trace-based guidelines, then route them into tactical or persistent system-prompt memory.
  • Takeaway: For agents, evolve per-role prompts online; do not just append failures to chat history.
  • Paper: https://arxiv.org/abs/2512.15374
  • Code: https://github.com/JarvisPei/SCOPE

28. AgentDevel

System Enhancement & Optimization

  • Background problem: Self-improving agents can raise averages while hiding regressions across versions.
  • Fundamental idea: Run traces, label symptoms blindly, script diagnoses, then promote one RC using pass→fail / fail→pass gates.
  • Takeaway: Build agent improvement as CI: auditable diffs, single version line, regression-first release decisions.
  • Paper: https://arxiv.org/abs/2601.04620
  • Code: Not found

29. ReCreate

System Enhancement & Optimization

  • Background problem: Creating domain agents needs evidence richer than final pass/fail scores.
  • Fundamental idea: Inspect trajectories, verifier logs, artifacts, and environments to edit scaffold components.
  • Takeaway: ReCreate outputs generalized prompts, workflows, tools, and memory—not tuned model weights.
  • Paper: https://arxiv.org/abs/2601.11100
  • Code: https://github.com/zz-haooo/ReCreate

30. AgentDiet

System Enhancement & Optimization

  • Background problem: Agent tool traces accumulate useless, repeated, and expired tokens that get resent every later step.
  • Fundamental idea: Use a cheap reflection LLM to rewrite one delayed long step within a small sliding context window.
  • Takeaway: Reduce cost empirically, not by guarantee: protect success with delay, thresholds, structure preservation, and pass-rate/step-count checks.
  • Paper: https://arxiv.org/abs/2509.23586
  • Code: Not found

31. SupervisorAgent

System Enhancement & Optimization

  • Background problem: Multi-agent runs fail or waste tokens through errors, loops, and bloated tool observations during execution.
  • Fundamental idea: Intercept each ActionStep with cheap heuristics, then replace observations, append guidance, or run verification.
  • Takeaway: Runtime supervision works best as gated process control, not always-on monitoring or post-hoc debugging.
  • Paper: https://arxiv.org/abs/2510.26585
  • Code: https://github.com/LINs-lab/SupervisorAgent

32. AgentSight

Trajectory Monitoring & Analysis Tools

  • Background problem: Framework logs miss shell escapes; syscall monitors miss the LLM intent behind file, process, and network effects.
  • Fundamental idea: Attach eBPF uprobes to TLS reads/writes and kernel probes to syscalls, then correlate by lineage, timing, and argument matches.
  • Takeaway: For coding agents, observability must follow descendant processes, not just tool calls, to catch prompt injection and exfiltration.
  • Paper: https://arxiv.org/abs/2508.02736
  • Code: https://github.com/eunomia-bpf/agentsight

33. AgentOps Automation

Trajectory Monitoring & Analysis Tools System Enhancement & Optimization

  • Background problem: Deployed agents are nondeterministic, stateful systems; traditional logs and dashboards miss planning, memory, tool, and coordination failures.
  • Fundamental idea: Define AgentOps as observe behavior, collect metrics, detect issues, identify root causes, recommend fixes, and automate runtime operations.
  • Takeaway: Treat agent reliability as production operations: instrument trajectories, compare healthy/failing traces, and close safe feedback loops with automated remediation.
  • Paper: https://arxiv.org/abs/2507.11277
  • Code: Not found

34. AgentDiagnose

Trajectory Monitoring & Analysis Tools Failure Attribution

  • Background problem: Trajectory replays show what happened, but rarely score how agents explore, plan, read observations, or verify progress.
  • Fundamental idea: Score trajectories on five agentic competencies, then expose patterns through CLI output, dashboard plots, word clouds, and navigation timelines.
  • Takeaway: Debugging tools should turn raw trajectories into selectable behavioral signals that support diagnosis and training-data filtering.
  • Paper: https://aclanthology.org/2025.emnlp-demos.15/
  • Code: https://github.com/oootttyyy/AgentDiagnose

35. Agent Trajectory Explorer

Trajectory Monitoring & Analysis Tools

  • Background problem: Raw agent trajectories mix prompts, reasoning, tool calls, and observations, making human oversight difficult.
  • Fundamental idea: Convert JSON traces via formatter plugins into linear TAO turns, with optional raw-context view.
  • Takeaway: Start debugging UIs with TAO step inspection and per-thought/action positive-negative feedback.
  • Paper: https://ojs.aaai.org/index.php/AAAI/article/view/35350
  • Code: Not found

36. AGDebugger

Trajectory Monitoring & Analysis Tools System Enhancement & Optimization

  • Background problem: Multi-agent failures live inside stateful message queues; log viewers cannot test “what if this message changed?”
  • Fundamental idea: Checkpoint each agent before messages, then restore, edit one message, and fork a replay branch.
  • Takeaway: Best for counterfactual steering; study users mostly added specificity, simplified tasks, or changed plans.
  • Paper: https://arxiv.org/abs/2503.02068
  • Code: https://github.com/microsoft/agdebugger

37. XAgen

Trajectory Monitoring & Analysis Tools Failure Attribution System Enhancement & Optimization

  • Background problem: Multi-agent failures are hard for mixed-expertise users to locate, attribute, and correct from raw logs.
  • Fundamental idea: Convert CrewAI logs into a live flowchart; attach human feedback; use an LLM judge to score task outputs.
  • Takeaway: Explanations matter when they identify a failing node and feed directly into prompt/config edits and reruns.
  • Paper: https://arxiv.org/abs/2512.17896
  • Code: Not found

38. DiLLS

Trajectory Monitoring & Analysis Tools

  • Background problem: Chronological multi-agent logs hide plan failures, skipped actions, and stalled progress across verbose agent/tool exchanges.
  • Fundamental idea: Use natural-language probes to organize traces into activity, action, and operation layers for drill-down diagnosis.
  • Takeaway: Debugging agents needs task-structured trace views: plan updates first, action outcomes next, raw logs last.
  • Paper: https://arxiv.org/abs/2602.05446
  • Code: Not found

39. TAR Study of SE Agents

Trajectory Monitoring & Analysis Tools Failure Taxonomy

  • Background problem: SE agents leave TAR logs, but failures hide in repeated actions, untested fixes, ignored results, and thought-action mismatches.
  • Fundamental idea: Normalize 120 RepairAgent, AutoCodeRover, and OpenHands trajectories; categorize actions, mine 4-grams, and open-code semantic TAR relations.
  • Takeaway: Successful agents balance explore/fix/test; robust agents should flag repetition, fix-without-test, premature termination, and result-insensitive next actions.
  • Paper: https://arxiv.org/abs/2506.18824
  • Code: https://github.com/sola-st/llm-agents-study

40. MAESTRO Evaluation Suite

Datasets & Benchmarks Trajectory Monitoring & Analysis Tools

  • Background problem: Final-answer scores miss MAS runtime variance, silent failures, and architecture-driven cost.
  • Fundamental idea: Run 12 heterogeneous MAS examples under one config, telemetry, and post-processing interface.
  • Takeaway: Benchmark agent architectures as systems: trace calls, retries, tokens, latency, failures, and stability.
  • Paper: https://arxiv.org/abs/2601.00481
  • Code: https://github.com/sands-lab/maestro

41. TrajectoryGuard

Trajectory Monitoring & Analysis Tools Failure Attribution

  • Background problem: Agent plans can be wrong for the task or structurally incoherent, and LLM judges are too slow for runtime screening.
  • Fundamental idea: Train a Siamese GRU autoencoder on task/trajectory pairs, combining contrastive alignment with sequence reconstruction anomaly signals.
  • Takeaway: Use learned trajectory guards for fast pre-execution checks; escalate uncertain or long-horizon cases to heavier judges.
  • Paper: https://arxiv.org/abs/2601.00516
  • Code: Not found

42. Features to Actions

Trajectory Monitoring & Analysis Tools

  • Background problem: Feature attribution explains one prediction, but tool agents fail across state, action, observation trajectories.
  • Fundamental idea: Package each run as a Minimal Explanation Packet: trace evidence plus rubric flags for intent, tools, state, recovery.
  • Takeaway: Use SHAP for aggregate rubric importance; use trace rubrics to locate the failed step or violated constraint.
  • Paper: https://arxiv.org/abs/2602.06841
  • Code: https://github.com/VectorInstitute/unified-xai-evaluation-framework