Most API docs contain dozens of code examples. Each one is a small task: to do X, send Y. That structure makes them a usable evaluation set for measuring how well a language model can act on your docs.
The same set tells you something about the docs themselves. Where the model picks the wrong endpoint or invents a parameter, you probably have a gap a human would also miss on first read.
Extracting cases
Walk through every example in your reference and quickstart. For each one, write down:
- The task in plain English, copied from the section heading.
- The expected request: method, path, headers that matter, body.
- The expected response shape, if it appears.
Save as a JSON file. A hundred cases is plenty to start with.
Running the eval
Pick a model. Give it your docs (or your spec) in context. For each case, prompt with just the task description and ask the model to produce a request.
Score on three things:
- Did it pick the right endpoint?
- Did it set the required parameters?
- Did the request match the schema?
For each failure, look at the doc page the model would have used. The fix is usually a sentence, not a rewrite.
What you learn
You learn which endpoints are confusing in isolation. You learn which parameter names get mixed up with each other. You learn where examples disagree with the schema.
You also get a number that can move over time. Run it on every doc change and you can tell whether the docs are getting easier or harder for a model to use.
A few things to watch:
- Spec-in-prompt and retrieval produce different scores. Test both and pick the setup closest to how your users actually feed your API to models.
- Score the first response, not after retries. Real callers do not always retry.
- A 90% pass rate is not a victory. The 10% are usually your most-used endpoints.