AI Review Evaluation Loop
AI review tuning is easiest when you can replay the production path against a fixed task result, then choose whether to keep the original context or rebuild it from current task knowledge and loaders.
QDash now provides a small replay workflow under qdash.copilot.evals.ai_review for that purpose.
What It Solves
The production AI review path is triggered from two places:
- workflow-side automatic note attachment during calibration execution
- API-side bulk or manual AI review requests from the chip/task-result UI
Both paths now share the same rendering helpers in src/qdash/copilot/review.py. The evaluation tooling reuses those helpers instead of duplicating the LLM call path.
Capture A Snapshot
Capture a replayable snapshot from a real task result:
uv run python -m qdash.copilot.evals.ai_review capture \
--task-name CheckQubitSpectroscopy \
--chip-id chip-1 \
--qid 4 \
--task-id task-result-id \
--output /tmp/ai-review-check-qubit-spec.jsonThe snapshot stores:
- task identifiers (
task_name,chip_id,qid,task_id) - the resolved
TaskAnalysisContext - expected images
- the selected analysis model
- the user message that was sent to the LLM at capture time
Replay A Snapshot
Replay the saved snapshot with the current copilot/review.yaml and current code:
uv run python -m qdash.copilot.evals.ai_review run \
--snapshot /tmp/ai-review-check-qubit-spec.json \
--mode frozen \
--output-dir /tmp/ai-review-run-1 \
--print-markdownReplay modes:
frozen: reuse the stored analysis context. This is best for prompt-only or model-only tuning.rebuild: rebuild the context from the saved task identifiers using the current code and current task knowledge. This is best when editingdocs/task-knowledge/*, context loaders, or context shaping logic.
Example rebuild run:
uv run python -m qdash.copilot.evals.ai_review run \
--snapshot /tmp/ai-review-check-qubit-spec.json \
--mode rebuild \
--output-dir /tmp/ai-review-run-2Each replay writes:
result.md: rendered AI review markdowncontext.json: the actual context used for that replayreport.json: metadata including source task IDs, prompt text, and selected model
Tuning Workflow
Use frozen mode when you are tuning:
analysis.ai_review_messageinconfig/copilot/review.yaml- model choice or output-token settings
- deterministic formatting or guard behavior
Use rebuild mode when you are tuning:
docs/task-knowledge/*TaskKnowledge.to_prompt()behavior- Copilot runtime context loaders
- pruning or reshaping of AI review context
Model Overrides
You can override the replay model without editing copilot/review.yaml:
uv run python -m qdash.copilot.evals.ai_review run \
--snapshot /tmp/ai-review-check-qubit-spec.json \
--mode frozen \
--model-provider openai \
--model-name gpt-5.1 \
--max-output-tokens 4096 \
--output-dir /tmp/ai-review-run-3Practical Loop
For short prompt/knowledge iteration cycles:
- Capture one representative snapshot per failure pattern you care about.
- Replay in
frozenmode while tuning the review message and formatting. - Replay in
rebuildmode after editing task knowledge or context builders. - Compare
result.mdandcontext.jsonacross runs before retrying against the live chip page or workflow trigger.