Skill AI & LLM →

Agent Eval Suite Langsmith

Production agent eval suite LangSmith dataset curation + Promptfoo assertion framework +…

A production-grade evaluation suite for AI agents that combines curated golden datasets, an assertion framework, and a CI gate. It replaces subjective manual spot-checks with an automated quality gate on every change: regression, adversarial, and calibration tests must pass before an agent ships.

$15 one-time
Add to a kit →

Prices include 20% VAT. · Forged on real agency work · one-time, no lock-in

  • Type Skill
  • Category AI & LLM
  • Delivery Email · instant
  • License One-time
Run preview
forgehouse, agent-eval-suite-langsmith

Inside the run · no black box

See the actual work before you buy it.

Every pull request that touches the agent has to clear this gate: a checksummed golden dataset, four stacked assertion types and a 95% pass bar. Here is how it runs:

  1. Harvests golden examples from production traces: pulls high-scored runs, strips personal data (emails, phones, ID numbers, API keys) with regex, and commits the dataset to Git with a SHA256 checksum
  2. On every pull request, verifies the dataset hash against the committed checksum first, so nobody can quietly tamper with the test set
  3. Runs the eval suite in parallel (50+ examples per agent, concurrency 10) against the live production system prompt referenced by file path, never a stale copy
  4. Chains four assertion types per output: regex bans on forbidden words, JSON schema validation, an independent judge model scoring a 5-point rubric, and embedding similarity against the expected answer
  5. Computes pass rate, Brier score and calibration error from the results; the merge gate blocks below 95% pass or above 0.15 Brier, because an overconfident agent is a liability
  6. Posts the report as a PR comment, uploads findings to the repo security tab, and fires an instant alert when the nightly full run catches a regression
Use cases · what happens when you plug it in

One power source. 6 lines out.

agent-eval-suite-langsmith · core

core active · 6 lines

  1. Adding regression tests for agents before merging changes

    ✓ adding regression tests
  2. Building a curated golden dataset of 50+ examples per agent

    ✓ building a curated golden
  3. Scoring agent output with an independent LLM-as-judge rubric

    ✓ scoring agent output with
  4. Blocking merges automatically when pass rate or calibration drops

    ✓ blocking merges automati…
  5. Red-teaming agents with prompt injection and jailbreak cases

    ✓ red-teaming agents with
  6. Measuring overconfidence with Brier score calibration

    ✓ measuring overconfidence…
Benefits · what you walk away with

Yours to keep.

Drag time forward. Watch what stays.

Forever

That's what owning means.

The rented stack

ai writing tool: subscription

expired · access lost

analytics suite: subscription

expired · access lost

design platform: subscription

expired · access lost

(nothing left)

Your forge

  1. Catch quality regressions before they reach users, not three weeks after

    license: perpetual
  2. Move from roughly 20% spot-check coverage to 100% on every change

    license: perpetual
  3. Replace 'looks good to me' with objective pass-rate and calibration gates

    license: perpetual
  4. Surface exactly which test failed and why, with full reasoning traces

    license: perpetual

subscriptions expire · deeds don't

What's included · the full manifest

Everything in the box.

Pick a piece up. Watch it work.

Golden dataset curation script with built-in PII stripping

part 01 of 06 · in the box

6 parts · one working system · ships instantly by email

Who it's for

This wasn't forged for everyone.

  • Not for you if you'd rather rent a tool than own one.
  • Not for you if you want someone else to run your stack.
  • Not for you if you're happy guessing.
Still here? Good.

AI engineering teams running multiple production agents who need automated, objective quality gates instead of manual review.

then this was forged for you.

Works with

Universal by design: these run in any AI. Delivered in the open Agent Skills + MCP format (native in Claude); ChatGPT, Gemini, Cursor and Copilot adapt the same files their own way.

  • Claude Native format
  • ChatGPT Adapts via open standards
  • Gemini Adapts via open standards
  • Cursor Adapts via open standards
  • Copilot Adapts via open standards
Questions · still in the air

Catch what's on your mind.

the air is clear. nothing between you and the forge.
catch a spark: the forge will answer

  1. Does it slot into my existing CI, or do I need a separate pipeline?

    It is built as a CI gate, so regression and adversarial tests run on every change inside the pipeline you already merge through. A change that fails the gate does not ship.

  2. Can I trust an LLM-as-judge to score another model's output?

    That is exactly why calibration tests sit alongside the regression and adversarial ones, checking the judge against known-correct examples. The rubric is also independent of the agent being graded, so it is not marking its own work.

  3. Does it write the golden dataset and fix the failures for me?

    No, you curate the 50+ examples per agent that define correct behavior, and the suite enforces them. It flags regressions; fixing the agent is your engineering work.

  4. How is it delivered?

    By email right after purchase: ready to run, downloaded instantly, no setup wait.

  5. One-time or subscription?

    A one-time purchase; no subscription or hidden fees. VAT (20%) is included.

  6. Can I get a refund?

    As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.