Back to skills

Auto Research

Run bounded autonomous code experiments on a user-provided repository with a stable automated metric. Use when the user wants iterative improve-measure-keep-or-revert loops such as tuning training scripts, benchmark solvers, evaluable agent workflows, or performance/code-quality experiments. Do not use for open-ended product development, tasks without a reliable automated metric, or broad multi-file refactors.

Auto Research

autoresearch

Run a constrained experiment loop on a repository: edit allowed files, run a command, extract a metric, keep the change if it improves the score, otherwise revert it.

This skill is for bounded optimization, not open-ended software development.

When to use

Use this skill when the task can be framed as:

  • a repository or working directory is known
  • the editable surface is explicitly bounded
  • there is a stable command to run
  • there is a stable metric to extract
  • success means improving that metric over a limited number of experiments

Good fits:

  • training-script tuning
  • benchmark or solver optimization
  • inference/runtime optimization with a measurable score
  • prompt/program search when evaluation is automated
  • agent workflow tuning with a fixed harness

Do not use when:

  • the user says only “make the project better”
  • there is no automated evaluation command
  • the metric is subjective or manual
  • the work requires wide architectural refactors
  • the best path is regular engineering, not iterative experiment search

Required inputs

Collect or infer these before starting:

  • repo_path: repository or working directory
  • goal: short statement of what is being optimized
  • editable_files: explicit allowlist of files the agent may change
  • run_command: command that runs one experiment
  • metric_name: the metric to optimize
  • metric_direction: min or max
  • max_experiments: hard cap on experiment count

Optional inputs

Use these when provided or clearly useful:

  • fixed_files: read for context but never modify
  • metric_extract_hint: grep pattern, regex, JSON path, or log line hint
  • timeout_seconds: per-run timeout
  • branch_name: experiment branch name
  • results_file: default results.tsv
  • baseline_required: default true
  • allow_dependencies: default false
  • allow_harness_changes: default false
  • session_report_file: default session_report.md

Output artifacts

Produce or maintain these when practical:

  • results.tsv or user-specified results file
  • one log per experiment or a rolling run.log
  • session_report.md with best result, kept changes, discarded themes, and crash notes
  • best commit hash or diff summary if git is available

Core workflow

1. Bound the search space

Identify the optimization surface first.

  • Read repository context and any user-provided instructions.
  • Read all fixed_files and editable_files.
  • Confirm that the metric and run command are stable enough to compare runs.
  • If the task lacks a stable metric, do not force this skill. Switch to normal engineering.

2. Prepare the experiment state

  • If the repo uses git, create or switch to a dedicated branch.
  • Record the starting commit or working tree state.
  • Create results.tsv with a tab-separated header if it does not exist.
  • Prefer this header:
commit	metric	status	description

If memory/runtime is an important side metric, extend the header instead of inventing multiple side logs.

3. Establish a baseline

Run the unmodified system first unless the user explicitly says not to.

  • Capture stdout/stderr into a log file.
  • Extract the metric.
  • Record the baseline row in results.tsv.
  • If no valid baseline can be established, stop and explain why.

4. Run a bounded loop

Repeat until max_experiments is reached or progress clearly stalls.

For each experiment:

  1. Choose one focused idea.
  2. Edit only files in editable_files.
  3. If using git, commit the candidate change before running.
  4. Run the experiment with log capture and timeout.
  5. Extract the metric using the strongest available method.
  6. Compare against the current best.
  7. Record a row in results.tsv.
  8. Keep the change if it improves the metric enough to justify complexity.
  9. Otherwise revert to the prior best state.

Prefer one change theme per run. Avoid bundling unrelated edits into one experiment.

5. End with a report

Write a concise session_report.md including:

  • goal and metric
  • baseline result
  • best result
  • kept experiments
  • discarded themes
  • crash summary
  • recommended next experiments

Metric extraction strategy

Use the most reliable method available, in this order:

  1. structured output already produced by the program
  2. exact grep target or fixed log line
  3. regex parsing from logs
  4. small helper parsing script only if necessary

Cross-check anomalies:

  • if the metric is missing, inspect the tail of the log
  • if the metric is implausible, verify the run actually completed
  • if outputs are noisy, prefer dedicated log redirection over reading streamed console output

Keep / revert rules

Default decision rule:

  • keep if metric improves in the target direction
  • revert if metric is equal or worse

But also apply a simplicity filter:

  • a tiny gain with ugly complexity is usually not worth keeping
  • an equal result with simpler code can be worth keeping
  • if a change weakens maintainability or breaks clarity, require a clearly meaningful metric win

Validation checklist

Before accepting a winning run, check:

  • only allowed files changed
  • command completed within the allowed time
  • metric was extracted from the intended source
  • result is better than the previous best by the configured direction
  • no obvious crash, NaN, or partial-run artifact was mistaken for success
  • repo is left at the kept frontier, not at a discarded candidate

Guardrails

Do

  • keep the experiment scope narrow
  • require a stable metric and automated command
  • prefer one hypothesis per run
  • preserve the evaluation harness unless explicitly allowed
  • revert losers quickly
  • log crashes and near-misses
  • cap the number of experiments unless the user explicitly requests unattended continuation

Don’t

  • do not run infinite NEVER STOP loops by default
  • do not modify files outside editable_files
  • do not redefine the metric unless the user explicitly asks for co-design
  • do not add dependencies unless allowed
  • do not keep complexity-heavy changes for negligible gains
  • do not confuse “program runs” with “experiment succeeded”

Only if

  • run unattended for a long time only if the user explicitly requests it
  • modify multiple files only if they are explicitly in scope
  • change harness or benchmark only when the user wants joint optimization of system and evaluator
  • use helper scripts only when direct tooling/parsing is too brittle

Failure modes and recovery

Crash or timeout

  • inspect the log tail first
  • try a quick, obvious fix once or twice
  • if the idea itself seems broken, log crash and revert
  • do not spend many experiments debugging one bad direction

Missing metric

  • verify the command actually finished
  • inspect the log for formatting drift
  • try a fallback extraction method
  • if still unavailable, mark the run invalid and revert

High metric variance

  • warn the user that the harness may be too noisy
  • consider re-running the best candidate once for confirmation
  • avoid over-claiming progress on tiny deltas

Dirty repo state

  • isolate experiment artifacts from source changes
  • avoid committing result logs unless the user wants them versioned
  • if the worktree becomes confusing, reset to the last known-good frontier before continuing

Minimal examples

Example 1: training script

User request:

Use autoresearch on this repo. Editable file is train.py. Run uv run train.py. Minimize val_bpb. Try 8 experiments.

Interpretation:

  • editable_files=["train.py"]
  • run_command="uv run train.py"
  • metric_name="val_bpb"
  • metric_direction="min"
  • max_experiments=8

Example 2: benchmark solver

User request:

Optimize this benchmark project. You may edit solver.py and config.py. Run python bench.py. Maximize score. Keep it to 6 experiments.

Interpretation:

  • run baseline
  • test one optimization idea per run
  • keep only score-improving or simplification-improving changes

Notes for use in this agent environment

  • Use manage_task for the experiment workflow.
  • Use concise progress updates for long runs.
  • Prefer existing tools and skills over large custom scripts.
  • If the task turns into ordinary coding work rather than bounded search, stop using this skill and switch modes.

Bridging AI and Skills

Get Started