Runner
Source files: src/runner/registry.ts, src/runner/execute.ts
The runner invokes LLM CLIs as child processes. Each LLM has a different CLI interface — Claude Code takes prompts as positional arguments (claude --print "prompt"), while most others accept them via stdin. The model registry abstracts these differences: each model key maps to a ModelConfig with a command() function that produces a uniform CommandSpec, and a parseOutput() function that normalises the raw response. The execution loop doesn’t know or care which LLM it’s talking to — it just resolves a model key and gets back everything it needs.
Tests run sequentially because LLM CLIs are rate-limited — parallel invocation would trigger throttling or errors. Each test goes through the full cycle (build prompt, resolve model, spawn process, capture output, parse results) before the next one starts.
CommandSpec
Section titled “CommandSpec”Describes how to invoke an LLM CLI:
interface CommandSpec { command: string; // CLI binary name (e.g. "claude") args: string[]; // Command-line arguments stdin?: string; // Optional stdin input}CommandOptions
Section titled “CommandOptions”Options passed to a model’s command() function:
interface CommandOptions { skipPermissions?: boolean;}ModelConfig
Section titled “ModelConfig”The interface each registry entry implements:
interface ModelConfig { command(prompt: string, options?: CommandOptions): CommandSpec; parseOutput(raw: string): string; tool: string; // Display name (e.g. "Claude Code") model?: string; // Model identifier (e.g. "claude-sonnet-4-6"), absent for default-only tools}Model registry
Section titled “Model registry”The registry is a flat object in src/runner/registry.ts mapping model key strings to ModelConfig entries. Factory functions create entries for each tool:
Factory functions
Section titled “Factory functions”| Factory | CLI binary | Prompt delivery | parseOutput behaviour | Permission flag |
|---|---|---|---|---|
claudeCode(model) | claude | Positional arg (--print <prompt>) | Extracts .result or .text from JSON, falls back to raw | --dangerously-skip-permissions |
gemini(model) | gemini | stdin | Passthrough | -y |
codex(model) | codex | stdin | Passthrough | --dangerously-bypass-approvals-and-sandbox |
aider(model, cliModel) | aider | --message <prompt> | Passthrough | --yes-always |
opencode(model) | opencode | stdin | Passthrough | (none) |
goose(model) | goose | stdin | Passthrough | (none) |
crush(model) | crush | stdin | Passthrough | (none) |
qwen(model) | qwen | stdin | Passthrough | (none) |
copilot(model) | copilot | stdin | Passthrough | (none) |
forge(model) | forge | stdin | Passthrough | (none) |
plandexDefault() | plandex | stdin | Passthrough | (none) |
openhandsDefault() | openhands | stdin | Passthrough | (none) |
cursorDefault() | agent | stdin | Passthrough | --force |
ampDefault() | amp | stdin | Passthrough | (none) |
Registry entries (by tool)
Section titled “Registry entries (by tool)”The registry contains 90+ model keys across 14 tools. Example entries per tool:
| Tool | Example keys |
|---|---|
| Claude Code | claude-code-opus-4-6, claude-code-sonnet-4-6, claude-code-sonnet-4-5, claude-code-haiku-4-5 |
| Gemini CLI | gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash |
| Codex CLI | codex-o3, codex-o4-mini, codex-gpt-4.1, codex-gpt-4.1-mini, codex-gpt-4.1-nano |
| Aider | aider-claude-opus-4-6, aider-gpt-4.1, aider-gemini-2.5-pro, aider-deepseek-r1, … |
| OpenCode | opencode-claude-opus-4-6, opencode-gpt-4.1, opencode-gemini-2.5-pro, … |
| Goose | goose-claude-opus-4-6, goose-gpt-4.1, goose-gemini-2.5-pro, … |
| Crush | crush-claude-opus-4-6, crush-gpt-4.1, crush-gemini-2.5-pro, … |
| Qwen | qwen3-coder-plus, qwen3-coder, qwen3-coder-fast |
| GitHub Copilot | copilot-claude-opus-4-6, copilot-gpt-4.1, copilot-gemini-2.5-pro, … |
| Forge | forge-claude-opus-4-6, forge-gpt-4.1, forge-gemini-2.5-pro, … |
| Plandex | plandex-default |
| OpenHands | openhands-default |
| Cursor | cursor-default |
| Amp | amp-default |
Run semtest list to see the full set.
Resolution
Section titled “Resolution”const DEFAULT_MODEL: ModelKey = "claude-code-sonnet-4-6";
function resolveModel(key: ModelKey): ModelConfigresolveModel() is a simple lookup — the ModelKey type is a union of all registered key strings, so invalid keys are caught at compile time.
Execution loop
Section titled “Execution loop”executeTests() in src/runner/execute.ts is the core orchestrator:
- Iterates over tests sequentially (LLM CLIs are rate-limited)
- For each test:
- Resolves per-test overrides from frontmatter (
llm,timeout,skipPermissionsIfPossible) - Resolves the
ModelConfigfrom the model key - Builds the prompt
- Builds the CLI command via
entry.command() - Spawns the process (
src/utils/process.ts) with optional timeout - Parses the raw output via
entry.parseOutput(), then extracts JSON results - Retries up to 3 times on empty responses
- Fires progress callbacks for terminal output
- Resolves per-test overrides from frontmatter (
- Supports repeat runs — each test can run N times, stopping on first failure
- Supports bail/maxfail — stops the entire suite after N failures
- Aggregates all results into a
RunResult
Frontmatter overrides
Section titled “Frontmatter overrides”Per-test configuration is resolved with frontmatter taking precedence over the global config:
const modelKey = test.frontmatter.llm ?? config.llm;const timeout = test.frontmatter.timeout ?? config.timeout;const skipPermissions = test.frontmatter.skipPermissionsIfPossible ?? config.skipPermissionsIfPossible;This allows individual spec files to use a different LLM or timeout without changing the global configuration.
Repeat execution
Section titled “Repeat execution”When config.repeat > 1, each test is run multiple times. If any repetition produces a failure or error, the loop breaks and uses that result. This is useful for flaky test detection — run each test 3 times to confirm consistency.
Bail / maxfail
Section titled “Bail / maxfail”The bail system provides early exit:
bail: true→ stop after the first failing test file (equivalent tomaxfail: 1)bail: N(number) → stop after N failing test filesbail: false→ run all tests regardless of failures
Timeout handling
Section titled “Timeout handling”When a timeout is configured (via --timeout flag, config timeout, or frontmatter timeout), each LLM subprocess is given a time limit. If exceeded, the process receives SIGTERM, then SIGKILL after 5 seconds. The test result is marked as an error with a timeout message. Signal handlers are installed to clean up any active child processes if the main process is interrupted (SIGINT/SIGTERM).
TestRunResult
Section titled “TestRunResult”Each individual test result within a run:
interface TestRunResult { id: string; // Test ID from LLM, or "{filename}#{index}" fallback sourceFile: string; // Filename (e.g. "auth-middleware.spec.md") sourceFilePath: string; // Absolute path to the source file result: TestResult; // Parsed result from the LLM group?: string; // Directory-based group}RunResult
Section titled “RunResult”interface RunResult { tests: TestRunResult[]; summary: { total: number; passed: number; failed: number; errored: number; invalid: number; skipped: number; }; status: "pass" | "fail" | "error"; timestamp: string;}Permission bypass
Section titled “Permission bypass”When skipPermissionsIfPossible is enabled (via config, CLI flag, or frontmatter), the command() function receives { skipPermissions: true }. Each factory function conditionally appends its tool-specific flag:
| Tool | Flag appended |
|---|---|
| Claude Code | --dangerously-skip-permissions |
| Gemini CLI | -y |
| Codex CLI | --dangerously-bypass-approvals-and-sandbox |
| Aider | --yes-always |
| Cursor | --force |
| Others | (none — silently ignored) |
Status precedence
Section titled “Status precedence”The overall run status follows: error > fail > pass. Invalid and skipped results don’t affect the overall status.
Progress callbacks
Section titled “Progress callbacks”The execution loop accepts optional callbacks for live terminal feedback:
| Callback | Fires when |
|---|---|
onTestStart | A test begins execution |
onTestRetry | An empty response triggers a retry |
onRepeatRun | A repeat iteration begins |
onTestComplete | A test finishes (success or failure) |
onDebugOutput | Raw stdout/stderr is captured (debug mode only) |
onBail | The bail threshold is reached and execution stops early |
Process spawner
Section titled “Process spawner”Source file: src/utils/process.ts
runCommand(spec, timeoutMs?) spawns an LLM CLI as a child process using child_process.spawn (no shell). It returns a Promise<LLMRawResult>:
interface LLMRawResult { stdout: string; stderr: string; exitCode: number; timedOut?: boolean; timeoutMessage?: string;}Timeout handling
Section titled “Timeout handling”When timeoutMs is set and > 0, a timer sends SIGTERM on expiry. If the process doesn’t exit within 5 seconds, it escalates to SIGKILL. The result includes timedOut: true and a human-readable timeoutMessage.
Signal cleanup
Section titled “Signal cleanup”On first invocation, runCommand installs SIGINT and SIGTERM handlers on the main process. If the user interrupts semtest (Ctrl+C), all active child processes are terminated before the main process exits (code 130 for SIGINT, 143 for SIGTERM).
stdin delivery
Section titled “stdin delivery”If spec.stdin is set (used by most tools except Claude Code), it’s written to the child process’s stdin and the stream is closed. If not set, stdin is closed immediately.