Decorative Curve
Back to Resources

Using Your API Docs as an Evaluation Set for LLMs

Every code example in your docs is a task and an expected answer. Score how well a model uses them and you have a working benchmark for both the model and the docs.

March 4, 20262 min read

Most API docs contain dozens of code examples. Each one is a small task: to do X, send Y. That structure makes them a usable evaluation set for measuring how well a language model can act on your docs.

The same set tells you something about the docs themselves. Where the model picks the wrong endpoint or invents a parameter, you probably have a gap a human would also miss on first read.

Extracting cases

Walk through every example in your reference and quickstart. For each one, write down:

  • The task in plain English, copied from the section heading.
  • The expected request: method, path, headers that matter, body.
  • The expected response shape, if it appears.

Save as a JSON file. A hundred cases is plenty to start with.

Running the eval

Pick a model. Give it your docs (or your spec) in context. For each case, prompt with just the task description and ask the model to produce a request.

Score on three things:

  • Did it pick the right endpoint?
  • Did it set the required parameters?
  • Did the request match the schema?

For each failure, look at the doc page the model would have used. The fix is usually a sentence, not a rewrite.

What you learn

You learn which endpoints are confusing in isolation. You learn which parameter names get mixed up with each other. You learn where examples disagree with the schema.

You also get a number that can move over time. Run it on every doc change and you can tell whether the docs are getting easier or harder for a model to use.

A few things to watch:

  • Spec-in-prompt and retrieval produce different scores. Test both and pick the setup closest to how your users actually feed your API to models.
  • Score the first response, not after retries. Real callers do not always retry.
  • A 90% pass rate is not a victory. The 10% are usually your most-used endpoints.
Connector
Everything to Build Great Docs
Connector
The Full Documentation Stack
Decorative CurveReady?
Get a preview
of your docs
ConnectorConnector
Decorative Curve
Terms of ServicePrivacy Policy
MSA