Auto Research

Run bounded autonomous code experiments on a user-provided repository with a stable automated metric. Use when the user wants iterative improve-measure-keep-or-revert loops such as tuning training scripts, benchmark solvers, evaluable agent workflows, or performance/code-quality experiments. Do not use for open-ended product development, tasks without a reliable automated metric, or broad multi-file refactors.

[email protected]

autoresearch

Run a constrained experiment loop on a repository: edit allowed files, run a command, extract a metric, keep the change if it improves the score, otherwise revert it.

This skill is for bounded optimization, not open-ended software development.

When to use

Use this skill when the task can be framed as:

a repository or working directory is known
the editable surface is explicitly bounded
there is a stable command to run
there is a stable metric to extract
success means improving that metric over a limited number of experiments

Good fits:

training-script tuning
benchmark or solver optimization
inference/runtime optimization with a measurable score
prompt/program search when evaluation is automated
agent workflow tuning with a fixed harness

Do not use when:

the user says only “make the project better”
there is no automated evaluation command
the metric is subjective or manual
the work requires wide architectural refactors
the best path is regular engineering, not iterative experiment search

Required inputs

Collect or infer these before starting:

repo_path: repository or working directory
goal: short statement of what is being optimized
editable_files: explicit allowlist of files the agent may change
run_command: command that runs one experiment
metric_name: the metric to optimize
metric_direction: min or max
max_experiments: hard cap on experiment count

Optional inputs

Use these when provided or clearly useful:

fixed_files: read for context but never modify
metric_extract_hint: grep pattern, regex, JSON path, or log line hint
timeout_seconds: per-run timeout
branch_name: experiment branch name
results_file: default results.tsv
baseline_required: default true
allow_dependencies: default false
allow_harness_changes: default false
session_report_file: default session_report.md

Output artifacts

Produce or maintain these when practical:

results.tsv or user-specified results file
one log per experiment or a rolling run.log
session_report.md with best result, kept changes, discarded themes, and crash notes
best commit hash or diff summary if git is available

Core workflow

1. Bound the search space

Identify the optimization surface first.

Read repository context and any user-provided instructions.
Read all fixed_files and editable_files.
Confirm that the metric and run command are stable enough to compare runs.
If the task lacks a stable metric, do not force this skill. Switch to normal engineering.

2. Prepare the experiment state

If the repo uses git, create or switch to a dedicated branch.
Record the starting commit or working tree state.
Create results.tsv with a tab-separated header if it does not exist.
Prefer this header:

commit	metric	status	description

If memory/runtime is an important side metric, extend the header instead of inventing multiple side logs.

3. Establish a baseline

Run the unmodified system first unless the user explicitly says not to.

Capture stdout/stderr into a log file.
Extract the metric.
Record the baseline row in results.tsv.
If no valid baseline can be established, stop and explain why.

4. Run a bounded loop

Repeat until max_experiments is reached or progress clearly stalls.

For each experiment:

Choose one focused idea.
Edit only files in editable_files.
If using git, commit the candidate change before running.
Run the experiment with log capture and timeout.
Extract the metric using the strongest available method.
Compare against the current best.
Record a row in results.tsv.
Keep the change if it improves the metric enough to justify complexity.
Otherwise revert to the prior best state.

Prefer one change theme per run. Avoid bundling unrelated edits into one experiment.

5. End with a report

Write a concise session_report.md including:

goal and metric
baseline result
best result
kept experiments
discarded themes
crash summary
recommended next experiments

Metric extraction strategy

Use the most reliable method available, in this order:

structured output already produced by the program
exact grep target or fixed log line
regex parsing from logs
small helper parsing script only if necessary

Cross-check anomalies:

if the metric is missing, inspect the tail of the log
if the metric is implausible, verify the run actually completed
if outputs are noisy, prefer dedicated log redirection over reading streamed console output

Keep / revert rules

Default decision rule:

keep if metric improves in the target direction
revert if metric is equal or worse

But also apply a simplicity filter:

a tiny gain with ugly complexity is usually not worth keeping
an equal result with simpler code can be worth keeping
if a change weakens maintainability or breaks clarity, require a clearly meaningful metric win

Validation checklist

Before accepting a winning run, check:

only allowed files changed
command completed within the allowed time
metric was extracted from the intended source
result is better than the previous best by the configured direction
no obvious crash, NaN, or partial-run artifact was mistaken for success
repo is left at the kept frontier, not at a discarded candidate

Guardrails

Do

keep the experiment scope narrow
require a stable metric and automated command
prefer one hypothesis per run
preserve the evaluation harness unless explicitly allowed
revert losers quickly
log crashes and near-misses
cap the number of experiments unless the user explicitly requests unattended continuation

Don’t

do not run infinite NEVER STOP loops by default
do not modify files outside editable_files
do not redefine the metric unless the user explicitly asks for co-design
do not add dependencies unless allowed
do not keep complexity-heavy changes for negligible gains
do not confuse “program runs” with “experiment succeeded”

Only if

run unattended for a long time only if the user explicitly requests it
modify multiple files only if they are explicitly in scope
change harness or benchmark only when the user wants joint optimization of system and evaluator
use helper scripts only when direct tooling/parsing is too brittle

Failure modes and recovery

Crash or timeout

inspect the log tail first
try a quick, obvious fix once or twice
if the idea itself seems broken, log crash and revert
do not spend many experiments debugging one bad direction

Missing metric

verify the command actually finished
inspect the log for formatting drift
try a fallback extraction method
if still unavailable, mark the run invalid and revert

High metric variance

warn the user that the harness may be too noisy
consider re-running the best candidate once for confirmation
avoid over-claiming progress on tiny deltas

Dirty repo state

isolate experiment artifacts from source changes
avoid committing result logs unless the user wants them versioned
if the worktree becomes confusing, reset to the last known-good frontier before continuing

Minimal examples

Example 1: training script

User request:

Use autoresearch on this repo. Editable file is train.py. Run uv run train.py. Minimize val_bpb. Try 8 experiments.

Interpretation:

editable_files=["train.py"]
run_command="uv run train.py"
metric_name="val_bpb"
metric_direction="min"
max_experiments=8

Example 2: benchmark solver