Using Your API Docs as an Evaluation Set for LLMs

Most API docs contain dozens of code examples. Each one is a small task: to do X, send Y. That structure makes them a usable evaluation set for measuring how well a language model can act on your docs.

The same set tells you something about the docs themselves. Where the model picks the wrong endpoint or invents a parameter, you probably have a gap a human would also miss on first read.

Extracting cases

Walk through every example in your reference and quickstart. For each one, write down:

The task in plain English, copied from the section heading.
The expected request: method, path, headers that matter, body.
The expected response shape, if it appears.

Save as a JSON file. A hundred cases is plenty to start with.

Running the eval

Pick a model. Give it your docs (or your spec) in context. For each case, prompt with just the task description and ask the model to produce a request.

Score on three things:

Did it pick the right endpoint?
Did it set the required parameters?
Did the request match the schema?

For each failure, look at the doc page the model would have used. The fix is usually a sentence, not a rewrite.

What you learn

You learn which endpoints are confusing in isolation. You learn which parameter names get mixed up with each other. You learn where examples disagree with the schema.

You also get a number that can move over time. Run it on every doc change and you can tell whether the docs are getting easier or harder for a model to use.

A few things to watch:

Spec-in-prompt and retrieval produce different scores. Test both and pick the setup closest to how your users actually feed your API to models.
Score the first response, not after retries. Real callers do not always retry.
A 90% pass rate is not a victory. The 10% are usually your most-used endpoints.

Using Your API Docs as an Evaluation Set for LLMs

Extracting cases

Running the eval

What you learn

What to read next

How Stale Documentation Affects Token Usage and Agent Behavior

What LLMs Tend to Hallucinate When API Docs Are Incomplete

Turning Support Tickets into Documentation Updates with an LLM in the Loop