Final-output scoring is not enough for agents; rich traces become the main evidence for attribution, repair, and regression testing.
Why ordinary evaluation is not enough
Final-output evaluation compresses a run to one bit and hides where it went wrong:
did planning decompose the task?
was the prompt + context constructed right?
was the right tool called with the right args?
was the observation interpreted correctly?
did routing / verification / supervision fire?
Trajectory analysis keeps that evidence and asks a sharper question:
which step changed the future of the run?
was the wrong answer caused there or downstream?
what evidence supports that step as the cause?
which control surface, if edited, would flip the outcome?
and would the same edit hold across other runs?
Engineering synthesis: instrument the execution path before tuning prompts. Otherwise every failure looks like “the model gave a bad answer.”
Formal execution model
Agent system M={a1,…,aN}
executes a task τ
in discrete steps. At step t
: select a(t)=g(ht−1)
→ form input xt=ϕ(ht−1)
→ run agent yt=πa(t)(xt)
→ update context ht=u(ht−1,a(t),xt,yt)
→ emit observable ot
.
selectiong
, input formationϕ
, agent policyπa
, context updateu
. Repair edits one of these.
Observed trace vs omitted state
The survey contrasts partial-observability traces (outputs only) with full-observability traces that include inputs, prompts, and environment state.
Diagnosis can only use what was preserved at run time. Whatever the trace omits becomes a guess that taxonomy, attribution, and repair will inherit.
Trace quality is an engineering choice that bounds every later layer: define what evidence you need for taxonomy and attribution before you decide what the trace must capture.
Engineering reading of the five dimensions
Failure Taxonomy
What it is: A shared map of what kind of failure a trajectory exhibits.
Why it matters: The category chooses the diagnostic search space (plan, memory, handoff, verifier, environment, runtime).
What it enables next: Targeted attribution and repair instead of one-size-fits-all prompt tweaks.
What a taxonomy must make explicit
For a failed trajectory T
, a taxonomy should assign τ(T,e)→(view,failure type,evidence span,repair hint)
, where e
is the trace evidence that supports the class.
View: Which lens the failure is read through (phase, capability module, system/interaction, domain).
Failure type: A short, repair-oriented label that distinguishes this failure from other categories.
Evidence span: The trace segment that justifies the label (plan step, memory update, handoff, verifier event, environment effect).
Repair hint: What a developer should look at next: prompt, context update, workflow graph, verifier, tool, or supervisor.
The important choice is not the label name. It is what evidence the label asks a developer to inspect and what repair it makes plausible.
Four ways to choose the diagnostic search space
Four complementary perspectives the survey identifies for organizing failure taxonomies.
Highlight papers
Taxonomy becomes a repair decision.
01 Lu: phase boundaries
Cut the trace into planning / execution / response and label faults per phase.
06 MAST: MAS organization
Sort failures by specification, inter-agent alignment, or verification gaps.
04 AgentRx: unrecoverable labels
Tie labels to violated constraints at the earliest unrecoverable step.
07 Aegis-Song: environment-shaped failures
Classify exploration, exploitation, and resource-exhaustion failures of agent–environment interaction.
01 Lu — phase boundaries
Claim: Align labels with the execution phase where evidence first appears: planning, execution, or response generation.
Trace lens: The unit of analysis is a phase-delimited run log, not a single final answer.
Developer takeaway: Add phase boundaries to traces before building dashboards or evaluators.
06 MAST — multi-agent organization
Claim: Shift taxonomy from a single-agent timeline to the organizational structure of multi-agent systems.
Trace lens: Evidence is not only what one agent did, but whether role specification, inter-agent alignment, or task verification failed.
Developer takeaway: If the trace has handoffs and role contracts, classify coordination before changing prompts; the repair may be topology or verifier design.
04 AgentRx — unrecoverable labels
Claim: Make taxonomy repair-oriented by tying labels to evidence-backed constraints and the first unrecoverable critical failure.
Trace lens: The trace is checked step by step against policies, tool schemas, prefix constraints, and task-specific requirements.
Developer takeaway: Convert informal “agent rules” into checkable constraints; labels become useful when they explain why no later step could recover.
AgentRx — Stage 1: constraint synthesis
Turn vague “agent rules” into checkable constraints anchored at each step.
Global constraints built once from the tool schema and domain policy — schema-valid invocation, declared policy compliance.
Dynamic constraints built per step from the task instruction and observed prefix — consistency with the latest tool output, prefix-implied obligations.
Guarded evaluation — each constraint runs only when its precondition fires; checks are programmatic when possible, semantic (LLM-judged) otherwise.
Output: a step-keyed map from constraint to satisfied / violated / not-applicable.
AgentRx — Stage 2: validation log
Compress violations so the judge reads evidence, not the whole trace.
Record only violations, not the whole trace — keeps the log compact.
Attach supporting evidence to every violation — the tool output, prefix span, or policy clause that triggered it.
Step-keyed and auditable — the judge can trace a final failure backward through dependent violations.
Output: a validation log of step, broken constraint, and supporting evidence per violation.
AgentRx — Stage 3: root-cause judge
Pick the first step from which the agent does not recover, and label its cause.
Find the first unrecoverable step — not the first error; the first violation that explains terminal failure.
Plan-axis labels(plan: did the agent pursue the right intent?) — instruction adherence, intent–plan mismatch, under-specified intent, unsupported intent.
Execution-axis labels(execution: did action complete?) — invalid invocation, guardrails triggered, system failure.
Output: critical failure step, failure category, and a short rationale.
07 Aegis-Song — environment-shaped failures
Claim: Classify failures by agent–environment interaction, not by chronology or capability module.
Trace lens: Group failures into exploration (incomplete information gathering), exploitation (mis-processing of gathered information), and resource exhaustion (turn or token budget).
Developer takeaway: When the symptom is “the agent ran out of budget” or “explored too narrowly,” the right repair surface is the environment / tool interface, not the prompt.
Failure Attribution
What it is: Identify who caused the failure, when it became unrecoverable, and why the symptom appeared there.
Why it matters: Repair without attribution is guesswork; attribution narrows the component to change.
What it enables next: A concrete repair target: policy, prompt, context update, handoff, verifier, tool, or runtime controller.
When failure becomes inevitable
Let Ω(T)∈{0,1}
be the task outcome of trajectory T
.
A step t
is the failure boundary if every feasible continuation from T≤t
fails (the survey’s “step at which failure becomes inevitable”).
Target: Attribution looks for t∗
, the earliest such step — typically far before the visible wrong answer. In practice, methods approximate t∗
with labels.
Four ways to justify blame
Four attribution paradigms the survey identifies, ordered by analytical depth.
Highlight papers
Attribution targets and evidence.
13 Who&When: label target
Defines culprit-agent and decisive-step labels for benchmarkable attribution.
18 CHIEF: causal structure
Converts flat logs into hierarchical causal graphs and back-traces dependencies.
22 DoVer: intervention evidence
Edits the orchestrator message or plan and replays to validate the hypothesis.
13 Who&When — attribution view
Claim: Turn failure attribution into a labeled target: identify the responsible agent and the earliest decisive error step.
Trace lens: The required object is an indexed multi-agent log with agent names, step numbers, task context, and outcome evidence.
Developer takeaway: Store stable agent and step identifiers; without them, “who failed” and “when it became unrecoverable” cannot be benchmarked or audited.
18 CHIEF — causal structure
Claim: Deepen attribution by converting flat logs into hierarchical causal graphs, then backtracking through dependencies and counterfactual screens.
Trace lens: The useful record contains subtasks, agents, structured step records, handoffs, data references, loops, and tool inputs/outputs.
Developer takeaway: Preserve edges, not just events. Propagation paths explain why the visible bad step may be a downstream symptom.
CHIEF — Stage 1: causal graph construction
Make the flat trace structurally readable.
Per step → OTAR. Extend the prior Thought / Action / Result with Observation — a slot for what each step received.
Three typed edges: subtask order, agent collaboration, and step-level data flow (upstream Result → downstream Observation).
Output: a hierarchical causal graph with subtask and agent nodes.
CHIEF — Stage 2: oracle-guided backtracking
Build a per-subtask checklist; walk the graph through it coarse to fine.
Per-subtask oracle. A 4-line LLM-written checklist: Goal (what this phase should achieve), Pre (what must hold before it starts), Evidence (facts/tool returns to verify), Pass/Fail (falsifiable post-hoc).
Top-down walk (reverse topological): subtask fails Pass/Fail → agent OTAR violates Pre/Evidence → step breaks the checklist.
Prune subgraphs that pass their oracle; drill into the rest.
Output: Failure Candidates — narrowed, not yet proven causes.
CHIEF — Stage 3: counterfactual attribution
Filter candidates along three axes: scope, propagation, persistence.
Local(scope: did it start here?) — no upstream cause explains the bad output → origin is the step itself.
Planning-control(propagation: control loop?) — planner repeats the same plan after error signals, or executor keeps violating valid replans.
Data-flow(propagation: corrupted value?) — walk step-edges back to the earliest step where valid inputs first became wrong.
Deviation-aware(persistence: did it stick?) — drop the candidate if a later step re-satisfies the oracle.
Output: one tuple (Agent, Step, Reason).
22 DoVer — intervention evidence
Claim: Treat attribution as an experimental question: hypothesize a failure point, edit the orchestrator message or plan, and replay from that checkpoint.
Trace lens: The trace must be segmentable into trials with preserved context, checkpoints, and milestone evaluations.
Developer takeaway: Build replay hooks early; causal confidence improves when suspected failures can be validated, refuted, or marked inconclusive.
DoVer — Stage 1: trial segmentation
Cut the session so each re-plan gets its own attribution unit.
Cut at re-plan steps — a planning/re-planning event marks a new trial boundary.
One trial = one plan — the contiguous span from a planning step through everything executed under that plan, until the next re-plan.
Prompt-based, not framework-specific — generalises to systems where re-plan markers aren’t explicit.
Output: trial segments, each treated as its own attribution candidate.
DoVer — Stage 2: hypothesise + intervene
Turn each trial into a testable edit, not a final verdict.
Per-trial hypothesis — log-based attribution names a suspected faulty agent, step, and rationale.
Treat hypothesis as testable, not authoritative — correctness is deferred to the replay.
Concrete edit at the orchestrator level — Modified Instructions to Sub-Agents or Plan Updates (never tool internals).
Output: a targeted intervention at the suspected fault point.
DoVer — Stage 3: replay + verify
Replay the edit; label outcomes along success and faithfulness.
Replay from the checkpoint(setup: same past, edited next step) — preserve all earlier state, then re-run the edited trial.
Validated(success: changed; faithfulness: followed edit) — at least 2 of 3 replays now succeed.
Partial / refuted(success: partial or unchanged; faithfulness: followed edit) — milestone progress improves, or the failure persists.
Inconclusive(faithfulness: edit not carried out) — replay cannot test the hypothesis.
Output: one validation label per trial.
Trajectory Monitoring & Analysis Tools
What it is: The evidence and control layer that decides what a developer can see, replay, compare, and intervene on.
Why it matters: Attribution and repair can only use what the trace preserves; output-only logs force diagnosis to guess.
What it enables next: Capture exactly the trace fields that the taxonomy and attribution objectives demand.
What observability must preserve
From passive monitoring to active debugging
Passive system-level monitoring
captures logs, events, metrics, side effects
summarizes long traces and surfaces patterns
flags anomalies and suspicious trajectories
does not change the run while observing
Active interactive debugging
inspects and annotates trajectory steps
resets, edits, and replays from a checkpoint
forks runs to test counterfactual edits
steers behavior with operator interventions
Highlight papers
Evidence capture and control.
32 AgentSight: system effects
Correlates LLM intent with kernel-level subprocess, file, and network events.
36 AGDebugger: control primitives
Exposes pause, checkpoint, reset, edit, fork, and compare across the trajectory.
32 AgentSight — system effects
Claim: Expand observability below the model log by correlating LLM intent signals with subprocesses, files, network, and kernel-level effects.
Trace lens: The trace has two streams — high-level intent and low-level system actions — joined by lineage, timing, and argument matching.
Developer takeaway: Monitor what the agent actually did to the system, not only what it said it intended to do.
36 AGDebugger — control primitives
Claim: Make trajectory analysis interactive by exposing pause, checkpoint, reset, edit, fork, and comparison operations over multi-agent sessions.
Trace lens: The trace is a rewindable state machine: message history plus checkpoints that can be modified and replayed.
Developer takeaway: Debugging tools should make counterfactual inspection cheap; a developer should test whether changing one message changes downstream behavior.
AGDebugger — Stage 1: inspect + steer
Expose the live message stream so the operator can steer before failure hardens.
Live message viewer — agent-to-agent traffic visible as it happens.
Pause / play / step — drive the message queue at any granularity.
Send new messages mid-run — broadcast to all agents, or targeted to one.
Output: an inspectable, controllable message stream.
AGDebugger — Stage 2: reset + edit
Restore state before testing a counterfactual edit.
Checkpoint per message — agent state saved before each new message via save_state.
Edit historical messages inline, then reset to that timestamp — restores the corresponding checkpoint via load_state.
Fork the session — the original branch is preserved; the edited path runs as a new session.
Output: a forked session with an edit candidate, replayable from the fork point.
AGDebugger — Stage 3: overview compare
Compare branches so edits become visible regression tests.
Vertical timeline — every message a rectangle, every fork a new column.
Forks marked with a horizontal dash; pre-fork shared history is shown at lower opacity.
Color toggle — encode message type, sender, or recipient depending on what you’re hunting.
Output: a branched comparison view — original run vs each counterfactual run, aligned by step.
System Enhancement & Optimization
What it is: Trace-guided editing of the agent system so future runs become more capable, reliable, efficient, or robust.
Why it matters: Diagnosis only earns its keep when it leads to a tested system change rather than another postmortem label.
What it enables next: Choose the right control surface to edit: policy, selection, context update, input formation, workflow, or supervisor.
What can be optimized
The formal target is not “make the prompt better.” It is choosing the system component whose change should improve the measured objective.
Three places a trace can change the system
Three enhancement families the survey identifies, by where the edit lands.
Highlight papers
Where the repair happens.
24 Maestro: graph-plus-config repair
Jointly searches workflow-graph and configuration edits from evaluator feedback.
31 SupervisorAgent: runtime control
Approves, guides, or corrects at risky interaction boundaries during execution.
07 Aegis-Song: environment optimization
Fixes failures by editing the environment and tool interface, not the agent prompt.
24 Maestro — graph-plus-config repair
Claim: Frame enhancement as joint optimization over workflow graph structure and configuration, guided by trace feedback.
Trace lens: Failures are evidence about missing computation, routing, validation, state, or tool operations in a typed agent graph.
Developer takeaway: Do not tune prompts when the graph lacks the operation needed to recover; add the missing node, edge, validator, or state variable.
31 SupervisorAgent — runtime control
Claim: Add a lightweight meta-agent that watches high-risk interactions and intervenes while the trajectory is still recoverable.
Trace lens: The supervisor observes agent-agent, agent-tool, and agent-memory events with local and global context summaries.
Developer takeaway: Put supervision at interaction boundaries: approve, guide, correct observations, or run verification before errors become irreversible.
07 Aegis-Song — environment optimization
Claim: Repair agent failures by editing the environment, not the prompt: enhance observability, offload deterministic computation, and speculate bundled actions.
Trace lens: When traces show exploration/exploitation/resource-exhaustion failures, the implicated control surface is the environment + tool interface, not πa
.
Developer takeaway: Before tuning prompts, ask: would the agent succeed if the environment exposed more state, did the deterministic work itself, or accepted batched actions?
Datasets & Benchmarks
What they are: The field’s measurement and training substrate: what counts as progress, and what diagnostic models learn from.
Why they matter: Benchmark labels become incentives; if they reward only agent/step accuracy, repair utility is invisible.
What they enable next: Compare attribution and enhancement methods, and run regression suites on real or injected failures.
What benchmark design optimizes for
Open critique: Agent-level and step-level accuracy are useful, but they do not fully measure whether attribution helps produce a safe, durable repair. The survey reports limited step-level attribution on established comparisons and points to benchmark diversity and observability as bottlenecks.
Two ways to build trajectory evaluation data
Real-world failure collections
preserve naturally occurring failures from real or realistic systems
reflect messy production conditions and repair needs
expensive to collect and annotate at scale
biased toward what was actually deployed
Synthetic error-injection datasets
start from successful trajectories and inject controlled faults
scale to thousands of labelled examples cheaply
support training data-hungry attribution models
must be checked for realism vs production failures
Highlight papers
Evaluation roles.
13 Who&When
Canonical agent + decisive-step labels for benchmarkable attribution.
23 TraceElephant
Full traces and reproducible environments for replay-ready evaluation.
Claim: As a benchmark, operationalize attribution as two measurable labels: culprit agent and decisive error step.
Trace lens: The dataset turns trace reading into comparable evaluation across alternative LLM-judge strategies.
Developer takeaway: Use this benchmark lens to test whether your traces expose enough indexing and context for reproducible blame assignment.
23 TraceElephant — benchmark view
Claim: Test attribution under full observability, including complete traces and reproducible environments.
Trace lens: The benchmark contrasts output-only attribution with full-trace static and dynamic replay/counterfactual probing.
Developer takeaway: Evaluate attribution under the observability conditions developers actually have; partial traces can understate achievable diagnosis.
04 AgentRx — benchmark view
Claim: As a benchmark, AgentRx packages 115 failed trajectories with three repair-relevant labels: critical step, root-cause category, and supporting evidence.
Trace lens: Each failed run is annotated by category, decisive step, and a step-indexed validation log of violated constraints.
Developer takeaway: Use this benchmark when the question is “does my attribution method enable a repair?” — not just “does it match the agent/step label?”
Still research-stage: SOTA exact step-level attribution ≈ 30% (CHIEF on Who&When); full-observability ceiling ≈ 33% (TraceElephant).
For builders: no turnkey tool yet — pick the right axis, expect partial coverage, build the loop incrementally.
Closed-loop agent improvement
Engineering synthesis over the survey: collect enough evidence, constrain diagnosis, approximate the causal boundary, edit the right component, and evaluate whether the edit improves future systems.
Current limitations and open problems
Attribution accuracy is not enough — agent-level and step-level metrics do not necessarily measure repair utility.
Benchmark observability can be mismatched — partial traces understate what developers actually see in internal debugging.
Tools remain fragmented — logging, visualization, replay, anomaly detection, RCA, and repair live in separate systems.
Domain structure is underused — coding, GUI, web, and embodied agents expose different trace shapes and failure modes.
Closing thesis
LLM agent trajectory analysis is becoming the engineering layer that connects evaluation, debugging, system optimization, observability, and operations.
The trajectory is the evidence. Taxonomy gives the diagnostic lens. Attribution identifies the causal boundary. Enhancement edits the system. Monitoring and benchmarks make the loop repeatable.
For agent developers, the practical question is no longer “did this run fail?” It is: what evidence proves why it failed, what component should change, and how do we know the repair generalizes?
Appendix
Each appendix slide summarizes one referenced paper.
The template is: background problem, fundamental idea, developer/evaluation takeaway, and survey areas.
Numbering note: The two-digit prefixes used in the deck (e.g. 01 Lu, 13 Who&When) follow our local analysis-note IDs, not the survey’s Table 1 IDs.
01. Exploring Autonomous Agents
Failure Taxonomy
Background problem: Success rate hides which agent role failed when planner, code generator, executor, or final answer hand off work.
Fundamental idea: Inspect full run logs from 204 executions, label failures into 19 causes under planning, execution, and response-generation phases.
Takeaway: Store per-iteration prompts, code, outputs, and errors, then route repeated failures to replanning, local repair, or early stop.
Background problem: Early agent mistakes cascade, but prior failure studies label errors without tracing the root cause or enabling fixes.
Fundamental idea: AgentDebug labels each step/module, uses counterfactual tests to find the earliest failure-causing step, then re-rolls out with targeted feedback.
Takeaway: Debug failed agents from the first causal step, not every visible mistake; use AgentErrorBench-style annotations to test localization and recovery.
Background problem: In platform agent workflows, visible errors often surface far from the causal node after prompts, tools, and control logic interact.
Fundamental idea: AgentFail labels 307 Dify/Coze failures by root location, cause level/category, propagation distance, and repair strategy.
Takeaway: Debug by proving the earliest decisive node, then apply cause-matched fixes; taxonomy plus location made repairs safer.
Background problem: Agents fail differently across DB, filesystem, CRM, and medical environments: missing state, losing state, miscomputing outputs, violating rules, exhausting turns.
Fundamental idea: Treat tools as reliability infrastructure: expose lookahead/state, offload sorting/calculation/rule checks, and speculate common follow-up calls.
Takeaway: Don’t only tune the agent; redesign tool responses so correct behavior becomes retrieval, validation, or bundled execution.
Background problem: Sequential agent pipelines hide where failures begin; final-output scoring cannot separate planner mistakes from executor or critic harm.
Fundamental idea: Record P/E/C answers, final answer, repair flags, harm flags, and earliest unrepaired error origin.
Takeaway: Design handoffs around computable accountability: each stage should expose whether it repaired, preserved, or damaged the prior state.
Background problem: Exact failure-step attribution in long multi-agent logs overwhelms general LLMs, especially when failures appear late or require MAS-specific expertise.
Fundamental idea: First shrink the log to a small suspect scope, then localize; scopes come from stepwise expansion or Overstep/Loop expert heuristics.
Takeaway: On Who&When, SDBL raises step accuracy by up to 24.27 percentage points, showing scoped diagnosis beats one-shot attribution.
Background problem: Flat logs hide long-range error propagation; local windows miss causally relevant earlier turns.
Fundamental idea: Build four positional context layers, run specialist analyst personas, and aggregate agent/step votes with confidence weighting and disagreement checks.
Takeaway: Use compressed context plus voting when trace length matters; unlike CHIEF graphs or RAFFLES loops, ECHO is positional, not causal.
Background problem: Single-pass judges miss decisive faults in long agent traces where early mistakes, symptoms, and recoverable errors blur together.
Fundamental idea: A Judge proposes a fault step; specialized Evaluators score fault, primacy, decisiveness, and log consistency, feeding memory into retries.
Takeaway: For trace debugging, use confidence-gated verifier loops, not one-shot attribution; stop when evaluators agree or max iterations expire.
Background problem: MAS failures often look downstream; log/data-flow analysis blames the final symptom, not the upstream bad decision.
Fundamental idea: Reverse data-flow into a performance-causality graph, then use Shapley, ACE, and counterfactual repairs to rank agents and steps.
Takeaway: Read for causal MAS debugging: strongest idea is performance causal inversion; evidence improves attribution but remains imperfect on hard traces.
Background problem: Temporal trace attribution confuses late failure symptoms with earlier corrupted information sources in multi-agent deep search.
Fundamental idea: Build an Information Dependency Graph: nodes are produced information, edges are explicit citation/reliance links across agent turns.
Takeaway: Debug agents by tracing provenance paths, not timelines; target high-impact source nodes and dependency conflicts for realistic failure tests.
Background problem: Manual MAS failure labels are tiny, so attribution models lack enough agent/error-mode pairs to learn root-cause patterns.
Fundamental idea: Start from successful trajectories, inject MAST-style faults via prompt injection or response corruption, then keep failed runs as labeled counterfactuals.
Takeaway: Aegis turns debugging data into controlled perturbation: 9,533 failures train Qwen/DCL/GRPO models and benchmark agent-error attribution.
Background problem: Output-only attribution hides prompts, task context, tool logs, and environment state, making the decisive failure step ambiguous.
Fundamental idea: TraceElephant packages 220 annotated failures as JSON step records plus runnable MAS environments for replay and counterfactual checks.
Takeaway: Log agent inputs and tool interactions, not just outputs; full traces lift step attribution to 30.3%, dynamic replay to 33.3%.
Background problem: Scalar workflow scores collapse rich failure traces, so global search like MaAS/AFlow misses recurring structural error modes.
Fundamental idea: CE-Graph clusters counterexamples by failing node and error semantics, then verifies RevisePrompt, InsertNode, or DeleteNode graph edits.
Takeaway: Optimize agent workflows by reducing dense failure modes, not by blindly searching for higher aggregate benchmark scores.
Background problem: Deployed agents are nondeterministic, stateful systems; traditional logs and dashboards miss planning, memory, tool, and coordination failures.
Fundamental idea: Define AgentOps as observe behavior, collect metrics, detect issues, identify root causes, recommend fixes, and automate runtime operations.
Takeaway: Treat agent reliability as production operations: instrument trajectories, compare healthy/failing traces, and close safe feedback loops with automated remediation.
Background problem: Trajectory replays show what happened, but rarely score how agents explore, plan, read observations, or verify progress.
Fundamental idea: Score trajectories on five agentic competencies, then expose patterns through CLI output, dashboard plots, word clouds, and navigation timelines.
Takeaway: Debugging tools should turn raw trajectories into selectable behavioral signals that support diagnosis and training-data filtering.
Background problem: SE agents leave TAR logs, but failures hide in repeated actions, untested fixes, ignored results, and thought-action mismatches.
Fundamental idea: Normalize 120 RepairAgent, AutoCodeRover, and OpenHands trajectories; categorize actions, mine 4-grams, and open-code semantic TAR relations.
Takeaway: Successful agents balance explore/fix/test; robust agents should flag repetition, fix-without-test, premature termination, and result-insensitive next actions.
Background problem: Agent plans can be wrong for the task or structurally incoherent, and LLM judges are too slow for runtime screening.
Fundamental idea: Train a Siamese GRU autoencoder on task/trajectory pairs, combining contrastive alignment with sequence reconstruction anomaly signals.
Takeaway: Use learned trajectory guards for fast pre-execution checks; escalate uncertain or long-horizon cases to heavier judges.