Does it slot into my existing CI, or do I need a separate pipeline?

It is built as a CI gate, so regression and adversarial tests run on every change inside the pipeline you already merge through. A change that fails the gate does not ship.

Can I trust an LLM-as-judge to score another model's output?

That is exactly why calibration tests sit alongside the regression and adversarial ones, checking the judge against known-correct examples. The rubric is also independent of the agent being graded, so it is not marking its own work.

Does it write the golden dataset and fix the failures for me?

No, you curate the 50+ examples per agent that define correct behavior, and the suite enforces them. It flags regressions; fixing the agent is your engineering work.

By email right after purchase: ready to run, downloaded instantly, no setup wait.

One-time or subscription?

A one-time purchase; no subscription or hidden fees. VAT (20%) is included.

As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.

Skill AI & LLM →

Agent Eval Suite Langsmith

Production agent eval suite LangSmith dataset curation + Promptfoo assertion framework +…

A production-grade evaluation suite for AI agents that combines curated golden datasets, an assertion framework, and a CI gate. It replaces subjective manual spot-checks with an automated quality gate on every change: regression, adversarial, and calibration tests must pass before an agent ships.

$15 one-time

Add to a kit →

Prices include 20% VAT. · Forged on real agency work · one-time, no lock-in

Type Skill
Category AI & LLM
Delivery Email · instant
License One-time

Run preview

forgehouse, agent-eval-suite-langsmith

Inside the run · no black box

See the actual work before you buy it.

Every pull request that touches the agent has to clear this gate: a checksummed golden dataset, four stacked assertion types and a 95% pass bar. Here is how it runs:

Harvests golden examples from production traces: pulls high-scored runs, strips personal data (emails, phones, ID numbers, API keys) with regex, and commits the dataset to Git with a SHA256 checksum
On every pull request, verifies the dataset hash against the committed checksum first, so nobody can quietly tamper with the test set
Runs the eval suite in parallel (50+ examples per agent, concurrency 10) against the live production system prompt referenced by file path, never a stale copy
Chains four assertion types per output: regex bans on forbidden words, JSON schema validation, an independent judge model scoring a 5-point rubric, and embedding similarity against the expected answer
Computes pass rate, Brier score and calibration error from the results; the merge gate blocks below 95% pass or above 0.15 Brier, because an overconfident agent is a liability
Posts the report as a PR comment, uploads findings to the repo security tab, and fires an instant alert when the nightly full run catches a regression

Use cases · what happens when you plug it in

One power source. 6 lines out.

agent-eval-suite-langsmith · core

core active · 6 lines

Adding regression tests for agents before merging changes

✓ adding regression tests
Building a curated golden dataset of 50+ examples per agent

✓ building a curated golden
Scoring agent output with an independent LLM-as-judge rubric

✓ scoring agent output with
Blocking merges automatically when pass rate or calibration drops

✓ blocking merges automati…
Red-teaming agents with prompt injection and jailbreak cases

✓ red-teaming agents with
Measuring overconfidence with Brier score calibration

✓ measuring overconfidence…

Benefits · what you walk away with

Yours to keep.

Drag time forward. Watch what stays.

Forever

That's what owning means.

The rented stack

ai writing tool: subscription

expired · access lost

analytics suite: subscription

expired · access lost

design platform: subscription

expired · access lost

(nothing left)

Your forge

Catch quality regressions before they reach users, not three weeks after
license: perpetual
Move from roughly 20% spot-check coverage to 100% on every change
license: perpetual
Replace 'looks good to me' with objective pass-rate and calibration gates
license: perpetual
Surface exactly which test failed and why, with full reasoning traces
license: perpetual

subscriptions expire · deeds don't

What's included · the full manifest

Everything in the box.

Pick a piece up. Watch it work.

Golden dataset curation script with built-in PII stripping

part 01 of 06 · in the box

6 parts · one working system · ships instantly by email

Who it's for

This wasn't forged for everyone.

Not for you if you'd rather rent a tool than own one.
Not for you if you want someone else to run your stack.
Not for you if you're happy guessing.

Still here? Good.

AI engineering teams running multiple production agents who need automated, objective quality gates instead of manual review.

then this was forged for you.

Works with

Universal by design: these run in any AI. Delivered in the open Agent Skills + MCP format (native in Claude); ChatGPT, Gemini, Cursor and Copilot adapt the same files their own way.

Claude Native format
ChatGPT Adapts via open standards
Gemini Adapts via open standards
Cursor Adapts via open standards
Copilot Adapts via open standards

Questions · still in the air

Catch what's on your mind.

the air is clear. nothing between you and the forge.

catch a spark: the forge will answer

Does it slot into my existing CI, or do I need a separate pipeline?

It is built as a CI gate, so regression and adversarial tests run on every change inside the pipeline you already merge through. A change that fails the gate does not ship.
Can I trust an LLM-as-judge to score another model's output?

That is exactly why calibration tests sit alongside the regression and adversarial ones, checking the judge against known-correct examples. The rubric is also independent of the agent being graded, so it is not marking its own work.
Does it write the golden dataset and fix the failures for me?

No, you curate the 50+ examples per agent that define correct behavior, and the suite enforces them. It flags regressions; fixing the agent is your engineering work.
How is it delivered?

By email right after purchase: ready to run, downloaded instantly, no setup wait.
One-time or subscription?

A one-time purchase; no subscription or hidden fees. VAT (20%) is included.
Can I get a refund?

As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.

Agent Eval Suite Langsmith

See the actual work before you buy it.

One power source. 6 lines out.

Yours to keep.

The rented stack

Your forge

Everything in the box.

This wasn't forged for everyone.

Works with

Catch what's on your mind.

Does it slot into my existing CI, or do I need a separate pipeline?

Can I trust an LLM-as-judge to score another model's output?

Does it write the golden dataset and fix the failures for me?

How is it delivered?

One-time or subscription?

Can I get a refund?

Related products

Brain Context Engineering

Brain Memory Hybrid Search

Claude Agent Template Library

Context Driven Development