Skip to content

Runner

Source files: src/runner/registry.ts, src/runner/execute.ts

The runner invokes LLM CLIs as child processes. Each LLM has a different CLI interface — Claude Code takes prompts as positional arguments (claude --print "prompt"), while most others accept them via stdin. The model registry abstracts these differences: each model key maps to a ModelConfig with a command() function that produces a uniform CommandSpec, and a parseOutput() function that normalises the raw response. The execution loop doesn’t know or care which LLM it’s talking to — it just resolves a model key and gets back everything it needs.

Tests run sequentially because LLM CLIs are rate-limited — parallel invocation would trigger throttling or errors. Each test goes through the full cycle (build prompt, resolve model, spawn process, capture output, parse results) before the next one starts.

Describes how to invoke an LLM CLI:

interface CommandSpec {
command: string; // CLI binary name (e.g. "claude")
args: string[]; // Command-line arguments
stdin?: string; // Optional stdin input
}

Options passed to a model’s command() function:

interface CommandOptions {
skipPermissions?: boolean;
}

The interface each registry entry implements:

interface ModelConfig {
command(prompt: string, options?: CommandOptions): CommandSpec;
parseOutput(raw: string): string;
tool: string; // Display name (e.g. "Claude Code")
model?: string; // Model identifier (e.g. "claude-sonnet-4-6"), absent for default-only tools
}

The registry is a flat object in src/runner/registry.ts mapping model key strings to ModelConfig entries. Factory functions create entries for each tool:

FactoryCLI binaryPrompt deliveryparseOutput behaviourPermission flag
claudeCode(model)claudePositional arg (--print <prompt>)Extracts .result or .text from JSON, falls back to raw--dangerously-skip-permissions
gemini(model)geministdinPassthrough-y
codex(model)codexstdinPassthrough--dangerously-bypass-approvals-and-sandbox
aider(model, cliModel)aider--message <prompt>Passthrough--yes-always
opencode(model)opencodestdinPassthrough(none)
goose(model)goosestdinPassthrough(none)
crush(model)crushstdinPassthrough(none)
qwen(model)qwenstdinPassthrough(none)
copilot(model)copilotstdinPassthrough(none)
forge(model)forgestdinPassthrough(none)
plandexDefault()plandexstdinPassthrough(none)
openhandsDefault()openhandsstdinPassthrough(none)
cursorDefault()agentstdinPassthrough--force
ampDefault()ampstdinPassthrough(none)

The registry contains 90+ model keys across 14 tools. Example entries per tool:

ToolExample keys
Claude Codeclaude-code-opus-4-6, claude-code-sonnet-4-6, claude-code-sonnet-4-5, claude-code-haiku-4-5
Gemini CLIgemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash
Codex CLIcodex-o3, codex-o4-mini, codex-gpt-4.1, codex-gpt-4.1-mini, codex-gpt-4.1-nano
Aideraider-claude-opus-4-6, aider-gpt-4.1, aider-gemini-2.5-pro, aider-deepseek-r1, …
OpenCodeopencode-claude-opus-4-6, opencode-gpt-4.1, opencode-gemini-2.5-pro, …
Goosegoose-claude-opus-4-6, goose-gpt-4.1, goose-gemini-2.5-pro, …
Crushcrush-claude-opus-4-6, crush-gpt-4.1, crush-gemini-2.5-pro, …
Qwenqwen3-coder-plus, qwen3-coder, qwen3-coder-fast
GitHub Copilotcopilot-claude-opus-4-6, copilot-gpt-4.1, copilot-gemini-2.5-pro, …
Forgeforge-claude-opus-4-6, forge-gpt-4.1, forge-gemini-2.5-pro, …
Plandexplandex-default
OpenHandsopenhands-default
Cursorcursor-default
Ampamp-default

Run semtest list to see the full set.

const DEFAULT_MODEL: ModelKey = "claude-code-sonnet-4-6";
function resolveModel(key: ModelKey): ModelConfig

resolveModel() is a simple lookup — the ModelKey type is a union of all registered key strings, so invalid keys are caught at compile time.

executeTests() in src/runner/execute.ts is the core orchestrator:

  1. Iterates over tests sequentially (LLM CLIs are rate-limited)
  2. For each test:
    • Resolves per-test overrides from frontmatter (llm, timeout, skipPermissionsIfPossible)
    • Resolves the ModelConfig from the model key
    • Builds the prompt
    • Builds the CLI command via entry.command()
    • Spawns the process (src/utils/process.ts) with optional timeout
    • Parses the raw output via entry.parseOutput(), then extracts JSON results
    • Retries up to 3 times on empty responses
    • Fires progress callbacks for terminal output
  3. Supports repeat runs — each test can run N times, stopping on first failure
  4. Supports bail/maxfail — stops the entire suite after N failures
  5. Aggregates all results into a RunResult

Per-test configuration is resolved with frontmatter taking precedence over the global config:

const modelKey = test.frontmatter.llm ?? config.llm;
const timeout = test.frontmatter.timeout ?? config.timeout;
const skipPermissions = test.frontmatter.skipPermissionsIfPossible ?? config.skipPermissionsIfPossible;

This allows individual spec files to use a different LLM or timeout without changing the global configuration.

When config.repeat > 1, each test is run multiple times. If any repetition produces a failure or error, the loop breaks and uses that result. This is useful for flaky test detection — run each test 3 times to confirm consistency.

The bail system provides early exit:

  • bail: true → stop after the first failing test file (equivalent to maxfail: 1)
  • bail: N (number) → stop after N failing test files
  • bail: false → run all tests regardless of failures

When a timeout is configured (via --timeout flag, config timeout, or frontmatter timeout), each LLM subprocess is given a time limit. If exceeded, the process receives SIGTERM, then SIGKILL after 5 seconds. The test result is marked as an error with a timeout message. Signal handlers are installed to clean up any active child processes if the main process is interrupted (SIGINT/SIGTERM).

Each individual test result within a run:

interface TestRunResult {
id: string; // Test ID from LLM, or "{filename}#{index}" fallback
sourceFile: string; // Filename (e.g. "auth-middleware.spec.md")
sourceFilePath: string; // Absolute path to the source file
result: TestResult; // Parsed result from the LLM
group?: string; // Directory-based group
}
interface RunResult {
tests: TestRunResult[];
summary: {
total: number;
passed: number;
failed: number;
errored: number;
invalid: number;
skipped: number;
};
status: "pass" | "fail" | "error";
timestamp: string;
}

When skipPermissionsIfPossible is enabled (via config, CLI flag, or frontmatter), the command() function receives { skipPermissions: true }. Each factory function conditionally appends its tool-specific flag:

ToolFlag appended
Claude Code--dangerously-skip-permissions
Gemini CLI-y
Codex CLI--dangerously-bypass-approvals-and-sandbox
Aider--yes-always
Cursor--force
Others(none — silently ignored)

The overall run status follows: error > fail > pass. Invalid and skipped results don’t affect the overall status.

The execution loop accepts optional callbacks for live terminal feedback:

CallbackFires when
onTestStartA test begins execution
onTestRetryAn empty response triggers a retry
onRepeatRunA repeat iteration begins
onTestCompleteA test finishes (success or failure)
onDebugOutputRaw stdout/stderr is captured (debug mode only)
onBailThe bail threshold is reached and execution stops early

Source file: src/utils/process.ts

runCommand(spec, timeoutMs?) spawns an LLM CLI as a child process using child_process.spawn (no shell). It returns a Promise<LLMRawResult>:

interface LLMRawResult {
stdout: string;
stderr: string;
exitCode: number;
timedOut?: boolean;
timeoutMessage?: string;
}

When timeoutMs is set and > 0, a timer sends SIGTERM on expiry. If the process doesn’t exit within 5 seconds, it escalates to SIGKILL. The result includes timedOut: true and a human-readable timeoutMessage.

On first invocation, runCommand installs SIGINT and SIGTERM handlers on the main process. If the user interrupts semtest (Ctrl+C), all active child processes are terminated before the main process exits (code 130 for SIGINT, 143 for SIGTERM).

If spec.stdin is set (used by most tools except Claude Code), it’s written to the child process’s stdin and the stream is closed. If not set, stdin is closed immediately.