Auto Research
Run bounded autonomous code experiments on a user-provided repository with a stable automated metric. Use when the user wants iterative improve-measure-keep-or-revert loops such as tuning training scripts, benchmark solvers, evaluable agent workflows, or performance/code-quality experiments. Do not use for open-ended product development, tasks without a reliable automated metric, or broad multi-file refactors.
autoresearch
Run a constrained experiment loop on a repository: edit allowed files, run a command, extract a metric, keep the change if it improves the score, otherwise revert it.
This skill is for bounded optimization, not open-ended software development.
When to use
Use this skill when the task can be framed as:
- a repository or working directory is known
- the editable surface is explicitly bounded
- there is a stable command to run
- there is a stable metric to extract
- success means improving that metric over a limited number of experiments
Good fits:
- training-script tuning
- benchmark or solver optimization
- inference/runtime optimization with a measurable score
- prompt/program search when evaluation is automated
- agent workflow tuning with a fixed harness
Do not use when:
- the user says only “make the project better”
- there is no automated evaluation command
- the metric is subjective or manual
- the work requires wide architectural refactors
- the best path is regular engineering, not iterative experiment search
Required inputs
Collect or infer these before starting:
repo_path: repository or working directorygoal: short statement of what is being optimizededitable_files: explicit allowlist of files the agent may changerun_command: command that runs one experimentmetric_name: the metric to optimizemetric_direction:minormaxmax_experiments: hard cap on experiment count
Optional inputs
Use these when provided or clearly useful:
fixed_files: read for context but never modifymetric_extract_hint: grep pattern, regex, JSON path, or log line hinttimeout_seconds: per-run timeoutbranch_name: experiment branch nameresults_file: defaultresults.tsvbaseline_required: defaulttrueallow_dependencies: defaultfalseallow_harness_changes: defaultfalsesession_report_file: defaultsession_report.md
Output artifacts
Produce or maintain these when practical:
results.tsvor user-specified results file- one log per experiment or a rolling
run.log session_report.mdwith best result, kept changes, discarded themes, and crash notes- best commit hash or diff summary if git is available
Core workflow
1. Bound the search space
Identify the optimization surface first.
- Read repository context and any user-provided instructions.
- Read all
fixed_filesandeditable_files. - Confirm that the metric and run command are stable enough to compare runs.
- If the task lacks a stable metric, do not force this skill. Switch to normal engineering.
2. Prepare the experiment state
- If the repo uses git, create or switch to a dedicated branch.
- Record the starting commit or working tree state.
- Create
results.tsvwith a tab-separated header if it does not exist. - Prefer this header:
commit metric status description
If memory/runtime is an important side metric, extend the header instead of inventing multiple side logs.
3. Establish a baseline
Run the unmodified system first unless the user explicitly says not to.
- Capture stdout/stderr into a log file.
- Extract the metric.
- Record the baseline row in
results.tsv. - If no valid baseline can be established, stop and explain why.
4. Run a bounded loop
Repeat until max_experiments is reached or progress clearly stalls.
For each experiment:
- Choose one focused idea.
- Edit only files in
editable_files. - If using git, commit the candidate change before running.
- Run the experiment with log capture and timeout.
- Extract the metric using the strongest available method.
- Compare against the current best.
- Record a row in
results.tsv. - Keep the change if it improves the metric enough to justify complexity.
- Otherwise revert to the prior best state.
Prefer one change theme per run. Avoid bundling unrelated edits into one experiment.
5. End with a report
Write a concise session_report.md including:
- goal and metric
- baseline result
- best result
- kept experiments
- discarded themes
- crash summary
- recommended next experiments
Metric extraction strategy
Use the most reliable method available, in this order:
- structured output already produced by the program
- exact grep target or fixed log line
- regex parsing from logs
- small helper parsing script only if necessary
Cross-check anomalies:
- if the metric is missing, inspect the tail of the log
- if the metric is implausible, verify the run actually completed
- if outputs are noisy, prefer dedicated log redirection over reading streamed console output
Keep / revert rules
Default decision rule:
- keep if metric improves in the target direction
- revert if metric is equal or worse
But also apply a simplicity filter:
- a tiny gain with ugly complexity is usually not worth keeping
- an equal result with simpler code can be worth keeping
- if a change weakens maintainability or breaks clarity, require a clearly meaningful metric win
Validation checklist
Before accepting a winning run, check:
- only allowed files changed
- command completed within the allowed time
- metric was extracted from the intended source
- result is better than the previous best by the configured direction
- no obvious crash, NaN, or partial-run artifact was mistaken for success
- repo is left at the kept frontier, not at a discarded candidate
Guardrails
Do
- keep the experiment scope narrow
- require a stable metric and automated command
- prefer one hypothesis per run
- preserve the evaluation harness unless explicitly allowed
- revert losers quickly
- log crashes and near-misses
- cap the number of experiments unless the user explicitly requests unattended continuation
Don’t
- do not run infinite
NEVER STOPloops by default - do not modify files outside
editable_files - do not redefine the metric unless the user explicitly asks for co-design
- do not add dependencies unless allowed
- do not keep complexity-heavy changes for negligible gains
- do not confuse “program runs” with “experiment succeeded”
Only if
- run unattended for a long time only if the user explicitly requests it
- modify multiple files only if they are explicitly in scope
- change harness or benchmark only when the user wants joint optimization of system and evaluator
- use helper scripts only when direct tooling/parsing is too brittle
Failure modes and recovery
Crash or timeout
- inspect the log tail first
- try a quick, obvious fix once or twice
- if the idea itself seems broken, log
crashand revert - do not spend many experiments debugging one bad direction
Missing metric
- verify the command actually finished
- inspect the log for formatting drift
- try a fallback extraction method
- if still unavailable, mark the run invalid and revert
High metric variance
- warn the user that the harness may be too noisy
- consider re-running the best candidate once for confirmation
- avoid over-claiming progress on tiny deltas
Dirty repo state
- isolate experiment artifacts from source changes
- avoid committing result logs unless the user wants them versioned
- if the worktree becomes confusing, reset to the last known-good frontier before continuing
Minimal examples
Example 1: training script
User request:
Use autoresearch on this repo. Editable file is
train.py. Runuv run train.py. Minimizeval_bpb. Try 8 experiments.
Interpretation:
editable_files=["train.py"]run_command="uv run train.py"metric_name="val_bpb"metric_direction="min"max_experiments=8
Example 2: benchmark solver
User request:
Optimize this benchmark project. You may edit
solver.pyandconfig.py. Runpython bench.py. Maximizescore. Keep it to 6 experiments.
Interpretation:
- run baseline
- test one optimization idea per run
- keep only score-improving or simplification-improving changes
Notes for use in this agent environment
- Use
manage_taskfor the experiment workflow. - Use concise progress updates for long runs.
- Prefer existing tools and skills over large custom scripts.
- If the task turns into ordinary coding work rather than bounded search, stop using this skill and switch modes.
You might also like
Academic Report Writer
Write structured academic reports, research papers, and scholarly documents with proper citations, methodology sections, and academic formatting.
Daily RSS Podcast
Generate a daily RSS podcast. Fetches latest articles from multiple RSS/Atom feeds, AI generates podcast script, uses ElevenLabs for TTS and mixes with BGM to produce final audio. Trigger: User says "generate today's podcast", "generate RSS podcast", "daily podcast". Not applicable for: Pure RSS reading, news summaries (when no audio output is needed).
ePub Translator
Translate ePub e-books between languages while preserving formatting, images, and structure.
Linear
Managing Linear issues, projects, and teams. Use when working with Linear tasks, creating issues, updating status, querying projects, or managing team workflows.
Postcard Designer
Design beautiful digital postcards with custom layouts, typography, and images using HTML/CSS. Create greeting cards and invitations.
SEO Reviewer
Audit websites for SEO issues and provide actionable recommendations. Analyze meta tags, headings, content structure, and technical SEO.