Test agents - Keystroke

Agents are non-deterministic, so testing them starts with exercising behavior, not writing a perfect assertion. The fastest loop is often to have your coding agent prompt the Keystroke agent through the CLI, try realistic scenarios, inspect the resulting sessions, and iterate on the system instructions and tools. Test agents at the boundary you care about:

Test style	Use when
Qualitative CLI runs	You want to see how the agent behaves across realistic prompts, follow-ups, and tool-use cases
Definition tests	You want to verify tools, model, prompt, or metadata without calling an LLM
Prompt smoke tests	You want to verify the agent can run against a real model
Tool-use tests	You need evidence that the agent called a required action or subagent
End-to-end tests	You want to run the project server and test API behavior

Start qualitatively, then turn the stable contracts you discover into tests. Automated tests are best for definition shape, required tool calls, and a small number of critical prompt paths. Run tests from the project root:

keystroke test --project unit          # src/**/*.test.ts
keystroke test --project integration   # src/**/*.int.test.ts
pnpm test                              # same — calls keystroke test

Vitest ships with @keystrokehq/cli. No project vitest.config.ts is required.

Qualitative tests

Before writing test files, run the agent the way you expect people to use it. Ask your coding agent to call the Keystroke CLI with a batch of prompts, inspect the sessions, and report where the agent misunderstood instructions, skipped tools, or used tools incorrectly.

keystroke agents prompt support --message "A customer asks whether they can get a refund after 45 days."
keystroke agents prompt support --message "Look up order ORD-123 and decide whether it is refundable."
keystroke agents prompt support --message "Draft a concise Slack reply for that customer."

Good qualitative prompts cover:

Normal requests the agent should handle cleanly.
Tool-required requests where the agent must call an action, workflow, MCP tool, or subagent.
Missing-information cases where the agent should ask a clarifying question instead of guessing.
Follow-up messages in the same session.
Edge cases that should be refused, escalated, or handled cautiously.

Then inspect the session:

keystroke agents sessions get support <session-id> --include messages,events,trace

Use this loop to tune the system prompt, tools, skills, files, and model choice. Once the behavior feels right, write focused tests for the parts that should not regress.

Definition tests

Definition tests are fast and do not need provider keys.

import { describe, expect, it } from "vitest";
import support from "./support";

describe("support agent", () => {
  it("uses the expected model and support files", () => {
    expect(support.slug).toBe("support");
    expect(support.model).toBe("anthropic/claude-sonnet-4.6");
    expect(support.systemPrompt).toContain("/workspace");
  });
});

Use this style to catch accidental model changes, missing tools, or prompt edits that remove required instructions.

Smoke-tests

The init template includes an agent integration test shaped like this:

import { describe, expect, it } from "vitest";
import hello from "./hello";

describe("hello agent", () => {
  it.skipIf(!process.env.ANTHROPIC_API_KEY)("responds to a prompt", async () => {
    const result = await hello.prompt({ message: "Say hi in one word." });

    expect(result.messages.some((message) => message.role === "assistant")).toBe(true);
  });
});

The provider-key guard keeps local and CI runs from failing when real model credentials are not available. Run integration tests from the project root:

keystroke test --project integration
# or
pnpm test -- --project integration

Integration tests load .env when present and skip when required keys (like ANTHROPIC_API_KEY) are unset. Vitest and its config ship with @keystrokehq/cli — no project vitest.config.ts.

Tool use tests

When a tool call is the contract, assert on the recorded messages rather than only the final answer.

import { describe, expect, it } from "vitest";
import orchestrator from "./orchestrator";

function usedTool(messages: Array<{ role: string; toolName?: string }>, toolName: string) {
  return messages.some((message) => message.role === "toolResult" && message.toolName === toolName);
}

describe("orchestrator agent", () => {
  it.skipIf(!process.env.ANTHROPIC_API_KEY)("delegates research", async () => {
    const result = await orchestrator.prompt({
      message: "Research whether the sky is blue. You must use ask_researcher.",
    });

    expect(result.error).toBeNull();
    expect(usedTool(result.messages, "ask_researcher")).toBe(true);
  });
});

Keep the prompt narrow. Tests that ask for broad natural-language behavior are more likely to be flaky than tests that assert a specific tool contract.

Sessions and memory

Prompt tests create sessions. If a test should be repeatable, use a fresh session or disable memory on the agent under test:

export default defineAgent({
  slug: "classifier",
  systemPrompt: "Classify the message and return only the label.",
  model: "anthropic/claude-sonnet-4.6",
  memory: false,
});

If you need a multi-turn test, keep the returned sessionId and pass it to the next prompt:

const first = await support.prompt({ message: "Remember the word orchid." });
const second = await support.prompt({
  sessionId: first.sessionId,
  message: "What word did I ask you to remember?",
});

Failure inspection

When a prompt test fails, inspect the session before changing code:

keystroke agents sessions list support --status failed
keystroke agents sessions get support <session-id> --include messages,events,trace

For deployed agents, use History in the web app and filter to agent runs. The detail panel shows messages, tool calls, metadata, and trace data.

Next steps

Run agents

Prompt agents locally and inspect sessions from the CLI.

Agent runs

Debug failed sessions in the web app.

Actions as tools

Build deterministic tool contracts that are easier to test.

Deploy a project

Run tests before deploying changed agents.

​Qualitative tests

​Definition tests

​Smoke-tests

​Tool use tests

​Sessions and memory

​Failure inspection

​Next steps