Runner

Source files: src/runner/registry.ts, src/runner/execute.ts

The runner invokes LLM CLIs as child processes. Each LLM has a different CLI interface — Claude Code takes prompts as positional arguments (claude --print "prompt"), while most others accept them via stdin. The model registry abstracts these differences: each model key maps to a ModelConfig with a command() function that produces a uniform CommandSpec, and a parseOutput() function that normalises the raw response. The execution loop doesn’t know or care which LLM it’s talking to — it just resolves a model key and gets back everything it needs.

Tests run sequentially because LLM CLIs are rate-limited — parallel invocation would trigger throttling or errors. Each test goes through the full cycle (build prompt, resolve model, spawn process, capture output, parse results) before the next one starts.

Types

`CommandSpec`

Describes how to invoke an LLM CLI:

interface CommandSpec {
  command: string;   // CLI binary name (e.g. "claude")
  args: string[];    // Command-line arguments
  stdin?: string;    // Optional stdin input
}

`CommandOptions`

Options passed to a model’s command() function:

interface CommandOptions {
  skipPermissions?: boolean;
}

`ModelConfig`

The interface each registry entry implements:

interface ModelConfig {
  command(prompt: string, options?: CommandOptions): CommandSpec;
  parseOutput(raw: string): string;
  tool: string;      // Display name (e.g. "Claude Code")
  model?: string;    // Model identifier (e.g. "claude-sonnet-4-6"), absent for default-only tools
}

Model registry

The registry is a flat object in src/runner/registry.ts mapping model key strings to ModelConfig entries. Factory functions create entries for each tool:

Factory functions

Factory	CLI binary	Prompt delivery	`parseOutput` behaviour	Permission flag
`claudeCode(model)`	`claude`	Positional arg (`--print <prompt>`)	Extracts `.result` or `.text` from JSON, falls back to raw	`--dangerously-skip-permissions`
`gemini(model)`	`gemini`	stdin	Passthrough	`-y`
`codex(model)`	`codex`	stdin	Passthrough	`--dangerously-bypass-approvals-and-sandbox`
`aider(model, cliModel)`	`aider`	`--message <prompt>`	Passthrough	`--yes-always`
`opencode(model)`	`opencode`	stdin	Passthrough	(none)
`goose(model)`	`goose`	stdin	Passthrough	(none)
`crush(model)`	`crush`	stdin	Passthrough	(none)
`qwen(model)`	`qwen`	stdin	Passthrough	(none)
`copilot(model)`	`copilot`	stdin	Passthrough	(none)
`forge(model)`	`forge`	stdin	Passthrough	(none)
`plandexDefault()`	`plandex`	stdin	Passthrough	(none)
`openhandsDefault()`	`openhands`	stdin	Passthrough	(none)
`cursorDefault()`	`agent`	stdin	Passthrough	`--force`
`ampDefault()`	`amp`	stdin	Passthrough	(none)

Registry entries (by tool)

The registry contains 90+ model keys across 14 tools. Example entries per tool:

Tool	Example keys
Claude Code	`claude-code-opus-4-6`, `claude-code-sonnet-4-6`, `claude-code-sonnet-4-5`, `claude-code-haiku-4-5`
Gemini CLI	`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`, `gemini-2.0-flash`
Codex CLI	`codex-o3`, `codex-o4-mini`, `codex-gpt-4.1`, `codex-gpt-4.1-mini`, `codex-gpt-4.1-nano`
Aider	`aider-claude-opus-4-6`, `aider-gpt-4.1`, `aider-gemini-2.5-pro`, `aider-deepseek-r1`, …
OpenCode	`opencode-claude-opus-4-6`, `opencode-gpt-4.1`, `opencode-gemini-2.5-pro`, …
Goose	`goose-claude-opus-4-6`, `goose-gpt-4.1`, `goose-gemini-2.5-pro`, …
Crush	`crush-claude-opus-4-6`, `crush-gpt-4.1`, `crush-gemini-2.5-pro`, …
Qwen	`qwen3-coder-plus`, `qwen3-coder`, `qwen3-coder-fast`
GitHub Copilot	`copilot-claude-opus-4-6`, `copilot-gpt-4.1`, `copilot-gemini-2.5-pro`, …
Forge	`forge-claude-opus-4-6`, `forge-gpt-4.1`, `forge-gemini-2.5-pro`, …
Plandex	`plandex-default`
OpenHands	`openhands-default`
Cursor	`cursor-default`
Amp	`amp-default`

Run semtest list to see the full set.

Resolution

const DEFAULT_MODEL: ModelKey = "claude-code-sonnet-4-6";

function resolveModel(key: ModelKey): ModelConfig

resolveModel() is a simple lookup — the ModelKey type is a union of all registered key strings, so invalid keys are caught at compile time.

Execution loop

executeTests() in src/runner/execute.ts is the core orchestrator:

Iterates over tests sequentially (LLM CLIs are rate-limited)
For each test:
- Resolves per-test overrides from frontmatter (llm, timeout, skipPermissionsIfPossible)
- Resolves the ModelConfig from the model key
- Builds the prompt
- Builds the CLI command via entry.command()
- Spawns the process (src/utils/process.ts) with optional timeout
- Parses the raw output via entry.parseOutput(), then extracts JSON results
- Retries up to 3 times on empty responses
- Fires progress callbacks for terminal output
Supports repeat runs — each test can run N times, stopping on first failure
Supports bail/maxfail — stops the entire suite after N failures
Aggregates all results into a RunResult

Frontmatter overrides

Per-test configuration is resolved with frontmatter taking precedence over the global config:

const modelKey = test.frontmatter.llm ?? config.llm;
const timeout = test.frontmatter.timeout ?? config.timeout;
const skipPermissions = test.frontmatter.skipPermissionsIfPossible ?? config.skipPermissionsIfPossible;

This allows individual spec files to use a different LLM or timeout without changing the global configuration.

Repeat execution

When config.repeat > 1, each test is run multiple times. If any repetition produces a failure or error, the loop breaks and uses that result. This is useful for flaky test detection — run each test 3 times to confirm consistency.

Bail / maxfail

The bail system provides early exit:

bail: true → stop after the first failing test file (equivalent to maxfail: 1)
bail: N (number) → stop after N failing test files
bail: false → run all tests regardless of failures

Timeout handling

When a timeout is configured (via --timeout flag, config timeout, or frontmatter timeout), each LLM subprocess is given a time limit. If exceeded, the process receives SIGTERM, then SIGKILL after 5 seconds. The test result is marked as an error with a timeout message. Signal handlers are installed to clean up any active child processes if the main process is interrupted (SIGINT/SIGTERM).

`TestRunResult`

Each individual test result within a run:

interface TestRunResult {
  id: string;              // Test ID from LLM, or "{filename}#{index}" fallback
  sourceFile: string;      // Filename (e.g. "auth-middleware.spec.md")
  sourceFilePath: string;  // Absolute path to the source file
  result: TestResult;      // Parsed result from the LLM
  group?: string;          // Directory-based group
}

`RunResult`

interface RunResult {
  tests: TestRunResult[];
  summary: {
    total: number;
    passed: number;
    failed: number;
    errored: number;
    invalid: number;
    skipped: number;
  };
  status: "pass" | "fail" | "error";
  timestamp: string;
}

Permission bypass

When skipPermissionsIfPossible is enabled (via config, CLI flag, or frontmatter), the command() function receives { skipPermissions: true }. Each factory function conditionally appends its tool-specific flag:

Tool	Flag appended
Claude Code	`--dangerously-skip-permissions`
Gemini CLI	`-y`
Codex CLI	`--dangerously-bypass-approvals-and-sandbox`
Aider	`--yes-always`
Cursor	`--force`
Others	(none — silently ignored)

Status precedence

The overall run status follows: error > fail > pass. Invalid and skipped results don’t affect the overall status.

Progress callbacks

The execution loop accepts optional callbacks for live terminal feedback:

Callback	Fires when
`onTestStart`	A test begins execution
`onTestRetry`	An empty response triggers a retry
`onRepeatRun`	A repeat iteration begins
`onTestComplete`	A test finishes (success or failure)
`onDebugOutput`	Raw stdout/stderr is captured (debug mode only)
`onBail`	The bail threshold is reached and execution stops early

Process spawner

Source file: src/utils/process.ts

runCommand(spec, timeoutMs?) spawns an LLM CLI as a child process using child_process.spawn (no shell). It returns a Promise<LLMRawResult>:

interface LLMRawResult {
  stdout: string;
  stderr: string;
  exitCode: number;
  timedOut?: boolean;
  timeoutMessage?: string;
}

Timeout handling

When timeoutMs is set and > 0, a timer sends SIGTERM on expiry. If the process doesn’t exit within 5 seconds, it escalates to SIGKILL. The result includes timedOut: true and a human-readable timeoutMessage.

Signal cleanup

On first invocation, runCommand installs SIGINT and SIGTERM handlers on the main process. If the user interrupts semtest (Ctrl+C), all active child processes are terminated before the main process exits (code 130 for SIGINT, 143 for SIGTERM).

stdin delivery

If spec.stdin is set (used by most tools except Claude Code), it’s written to the child process’s stdin and the stream is closed. If not set, stdin is closed immediately.